Analytics API query assistance

Hi.

I have a requirement to identify 5 consecutive API responses (any type of response) taking over 30 secs (from a request entering Apigee API proxy to being returned back out of the proxy) and also any 5xx responses. This is to be determined as an 'outage' within our platform. When this is the case we need to identify this as quickly as can be provided from Apigee.

Please can you advise whether I can get such information from the analytics API by querying for such an event? Ideally this would trigger an alert from Apigee but if it is not possible to get this then this can be triggered from the data being ingested into event management system.

I would be grateful if anyone can advise whether this is possible through the API and/or Apigee alerting and any examples of such a query would be greatly appreciated.

Thanks

Mike

1 2 398
2 REPLIES 2

You can get this information from the Analytics API, but not quickly.

What you want is API Monitoring. This is a newer feature of Apigee Edge, and it's something we're pretty excited about.

This video shows you how you can define an alert, with your own criteria.

If you rewind to earlier in that video, it shows you what it might look like when an alert actually fires.

Now, as to your criteria.

You said: "5 consecutive requests with > 30s response time"

That isn't quite possible in Apigee Edge, but you can get something BETTER.

The monitoring system is designed to handle higher throughput APIs, and rather than counting "5 in a row", we opted to allow people to monitor statistically computed latency. This avoids the problem of bad luck.

For example.... 4 requests in a row respond in around 450ms and then 1 request responds in 72ms. And then another sequence where 4 requests in a row respond in 423ms each, and then a fifth responds in 86ms. And then another 5, following that pattern so that the latencies are like this:

[ 449, 432, 461, 443, 72, 481, 465, 452, 441, 86, 441, 447, 448, 455, 81 ]

Suppose your "good response time" threshold is 100ms. Under your proposed rule "5 consecutive requests over <threshold>" , this series of requests would not fire any alert, even though the system is delivering an exceedingly poor experience to callers.

API Monitoring in Apigee Edge uses the latency percentiles - P50, P95, P99 - implying 50th, 95th or 99th percentiles, as triggers. In English, the P99 is the latency threshold under which 99% of requests respond. API program managers commonly use this statistical value to track service quality.

To compute the percentiles, you need to measure the latency for a set of requests, let's say all requests in a given hour. Then, sort that list of N elements, from smallest to highest. Then, for a given latency percentile P, the index of the Pth "fastest" request in the sorted list is:

n = ceiling((P / 100) x N)

And then the latency at that percentile is the value in the sorted list at that index. Does that make sense? Let's look at a specific example. Suppose over a given time interval, you have 15 requests, with these measured latencies in milliseconds:

[ 141, 142, 233, 384, 225, 216, 267, 198, 349, 410, 111, 152, 173, 284, 168 ]

The sorted list is:

/*  1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 */
[ 111, 141, 142, 152, 168, 173, 198, 216, 225, 233, 267, 284, 349, 384, 410 ]

You can then compute the percentiles for that time interval like this:

pct formula n latency value
50 ceil((50/100) * 15) 8 216
75 ceil((75/100) * 15) 12 284
90 ceil((90/100) * 15) 14 384

In English, "50% of the calls finished in 216ms or less". and "75% of the calls finished in 284ms or less". Or more succinctly, "P50 latency was 216ms" and "P75 was 284ms" .

Now let's consider my contrived "poor performance" response time data set from above. Unsorted, the latencies are like this:

[ 449, 432, 461, 443, 72, 481, 465, 452, 441, 86, 441, 447, 448, 455, 81 ]

Sorted, the list looks like this:

/* 1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 */
[ 72,  81 , 86, 432, 441, 441, 443, 447, 448, 449, 452, 455, 461, 465, 481 ]

Therefore the percentile table for this data set looks like this:

pct formula n latency value
50 ceil((50/100) * 15) 8 447
75 ceil((75/100) * 15) 12 455
90 ceil((90/100) * 15) 14 465

So we can see that even the 50th percentile response time is well over the hypothetical 100ms threshold we established above.

Using latency percentiles to monitor the performance behavior of the system works better when your system serves 10's to millions of calls, and you want to make sure that you have generally good service governed by a particular pre-defined response-time threshold. A simple rule like "5 in a row" might allow poor service to go unnoticed.

You also said you wanted to be alerted for "any 5xx responses".

That is supported right out of the box with the API Monitoring in Apigee Edge.

Many thanks for your advice Dino.

Unfortunately the requirement to define an outage as 5 consecutive responses >30 seconds (or any combination of 5xx errors and >30 second responses in 5 consecutive responses) has been dictated by a EBA (European Banking Authority) in relation to Openbanking implementation. I agree that it is an odd method to determine service availability but it is to be used more as a retrospective method of calculating availability rather than a method of alerting that service is performing poorly. As a provider I/we need to report on metrics such as these.

Is there any way that you know of to get this information from the Analytics API given that there doesn't appear to be a way of doing so from the API monitoring?

Thanks

Mike