Hi all - This article will capture your top-queries/reports for a production-grade Apigee infrastructure. This is assumes you have ingested the logs into log-aggregator (Splunk, DataDog, ELK, .. ) and have enabled APM monitoring. For alerts, instrument per your needs - pager-duty, Newrelic, ..
Apigee Edge / Business metrics
- Measure global 2xx rate
- Measure 2xx rate by virtual host
- Total traffic counts by developer-app ( Time Measurements - rolling 4 hours, daily, weekly)
- Measure 2xx by product
- Measure 2xx, 5xx by product
- Measure 2xx of your target-servers
System metrics:
- Measure tcp-opens / close-waits on routers and message-processors
- Measure connection-counts on your front-end load balancers
- C* (Cassandra) - monitor token keyspace (important health metric / point of no-return issues)
- ZK (zookeeper) - measure sync-rate to RMPS. (Seen slowness effect deploy times - multi-region datacenter)
- OS specifics - RHEL vs. Ubuntu vs. Amazon-AMI (compare against benchmarks)