Apigee policy takes execution takes a lot of time

farhanukhan · 07-14-2020 11:46 PM

Hi Guys,

We have a problem where when under load individual policies takes a lot of time to execute (some time 500 ms) even if they are being skipped. Can you please give me a direction where to look at? Initially we thought it was the thread pool we increased the pool from 100 to 300 but still no result. Memory and processor also look healthy.

Thanks & Regards

dchiesa1

I can make some suggestions.

Apigee is usually part of a distributed system. To diagnose the overall performance of the system, it makes most sense to analyze all the pieces.

To give one simple example. Suppose you have a well-designed Apigee API Proxy, and the target for that system is a backend system that runs on a legacy Java app server. Suppose that target runs on a pool of VMs, with plenty of CPU and memory. But the Java apps running on the VMs need to access a database or queue, and THAT system is slow, overburdened, and delivers inconsistent performance. We observe that the response time latency of the Java app system, when under load, averages 5s, but sometimes is up to 30s.

The full data flow has the client sending a request to Apigee, which then invokes the Java system, which sometimes invokes the queue and sometimes the database. Apigee must wait on the Java system. As Apigee accepts inbound calls, for some it must hold the request context for 5 seconds. For others it must hold the request context for 30 seconds, or maybe more, while it waits on the Java system. In the beginning, when few calls are being buffered by Apigee, the memory management in the Apigee VM is not under stress. But over time, more and more calls stack up in the memory of Apigee. Now the Apigee VM is doing lots of memory management, garbage collection, and so on, while still waiting. Because the VM CPU is doing so much work buffering memory, it can have repercussions on other activities, so that even small policies executing int he API proxy will take a long time. It's not the policy that actually consumes the time. It's that the policy gets starved of memory throughput, so it has to wait for the housekeeping work to complete.

This is just one scenario.

The key point here is that the inconsistency in the latency delivered by the backend system can cause contention in Apigee. Lower timeouts in Apigee can help avoid the "stacking up" of pending requests.

In another scenario, the number of requests is not necessarily large, but the request or response payloads is very large. In that case, the API traffic can saturate the network. In this scenario, it does not matter how much memory or CPU is available to the Apigee VM, the I/O channels are full, so the memory and CPU is under utilized. We would say this is an I/O constrained system.

To determine if your system is I/O constrained you need to analyze the volume of data sent in requests and responses, and compare that against the network capacity you have for your VM. If you have a virtualized network, keep in mind that the full rated capacity of the network may be shared across many VMs. Which means that a database system may be causing traffic on the network, which may cause Apigee VMs to wait or suspend. I wrote about this in more detail one week ago. When this happens, again, the symptom you *may* see is that a simple policy in Apigee takes a long time to run. The problem is not in the policy. The problem may be in the network.

So , analyzing performance of distributed systems is complicated. My suggestion is to take a careful look at all the linkages and consider latency and network I/O at any point in the chain.

A metaphor you may find useful is a train, made up of a number of train cars connected together. It is not possible to have one car in that train to move faster than another. The actors in the data path of a distributed system are like that. With a client, Apigee, a backend app, and a backend datastore in the train... the slowest system determines the performance of the system. If the client cannot SEND data into Apigee quickly, then everything will be slow. If the backend system waits a long time before responding, then everything will stack up.

What cars are in your train?

farhanukhan

Thanks Dino for the valuable input. I'll share your recommendations with the team. We actually use APIGEE quite heavily and its an on-prem deployment.