Identify source of 30 second proxy latency

mark_hammelman · 02-05-2020 01:13 PM

I've recently started getting complaints from clients that proxies that normally are sub-second will occasionally take >30 seconds.

I used a custom report to help identify whether the spike is happening on the request or response, and its definitely on the 'request' side of the flow. Its always a spike of 30s.

Heres an example where you can see the request side spikes up to 30Kms while response is consistently under 200ms.

This is happening a few times a day across a handful of proxies. We don't know how to recreate it so I've never been able to catch it while running a trace.

Any advice or suggestions for what I can do to narrow it down further and identify which specific policy or sharedFlow that is creating these delays?

edit: These are executing in Edge public cloud in a paid org.

dknezic

Do you have any policies such as a service callout or javascript making external call outs?

Also, you should look at analytics relating to your proxy's target as well. Response processing latency is not target response time, but time on your proxy's response flows ie policies.

mark_hammelman

Hi Dane,

Thanks for taking the time to respond!

Yes, the proxies do all use a sharedFlow we use for validating authentication. The sharedFlow also has a javascript policy. We use this sharedFlow on over 250 proxies, but I'm seeing this problem happen on only 13 proxies so I'm hesitant to blame the sharedFlow quite yet. Its worth some extra scrutiny though.

I've been running custom reports and comparing with logs some more and this does not appear to be a critical problem. Most of the time the client receives a '200 OK' and its only happening to about 0.1% of the traffic, so my hunch is that this is related to a component in AWS needing to auto-scale.

I recognize that the "request" and "response" processing latency is time spent within Apigee and not the targetServer. I could tell from other reports that the backend targets were not the source of these 30 second delays, so my next step was trying to determine if it was on the 'request' vs 'response' flow of the proxy. As seen in the 'exhibit-a', its definitely on the 'request'.

ylesyuk

As per another question of a similar nature,

https://community.apigee.com/questions/71053/finding-one-error-among-thousands-of-successes.html?chi...

Use https://github.com/yuriylesyuk/ets utility to automate sampling of traces. For each captured trace identify requests that take more than 30 seconds. You can create a trivial script that would parse trace files and identify times spent in each request and each policy.

Make ets utility invoked in a loop and let it run... until you catch your culprit.

I have attached an awk script that filters out debug request id, policy name, and policy duration.

WARNING, it is a gawk awk's dialect [for regex]!

Command:

$ awk -f trace-policyelapsedtimes.awk trace-1581013540920.xml | awk '$3 > 30'

will print out any policies, whose duration is more than 30 seconds.

trace-policyelapsedtimesawk.txt

mark_hammelman

I'll give this a try, thank you!