Solved: Sporadic Javascript policy timeouts

Report Inappropriate Content · 06-13-2016 05:56 PM

We're seeing about 3% of our traffic through Apigee failing due to Javascript timeouts

Execution of Javascript.MyPolicy failed with error: Javascript runtime exceeded limit of 200ms

I realize that we can update our policy config to increase the default timeout from 200ms, but our these policies are very simple (retrieving a context variable and setting another) and they generally complete within 3ms. We've increased our memory allocation to the message processor but the issue remains. Each of our Javascript policies has this error, from the simplest one to more complex ones.

Any suggestions? The performance is generally good, but it just spikes randomly.

[edit] We're using OPDK 4.15.07.00. In particular, from "get-version.sh"

Installed          Current Version
Apigee Enterprise  1.0.0.1078.fe7934c.1509010011        
Apigee UI          4.15.07.00-768b187-20151006-203453  Version 4.15.07.00-75d1384-20150823-004603 is available for upgrade
                                                       /opt/apigee4/share/installer/apigee-upgrade.sh -c ui
Cassandra          2.0.15                               
Zookeeper          3.4.5                                
QPID               0.14                                 
Postgres           9.3

Report Inappropriate Content

After working with Apigee Support, we updated to use G1GC and increased the router & message-processor heap allocation. This has eliminated 99.9% of Javascript Timeouts.

View solution in original post

anilsr

@Eric Dahl , Is this on OnPremises Apigee Edge ? Which version do you see this issue ?

DChiesa

Yes, see Anil's question. Or, if you are in the Edge public cloud, is it a paid organization? Or a Trial organization.

Report Inappropriate Content

added our OPDK version info to the question

DChiesa

it may be undesirable, but you could just raise the limint beyond 200ms, to handle the 3% case. It's what I do if I run into such problems. Better to give a delayed response than no response.

Report Inappropriate Content

So two orders of magnitude increase (from under 3ms to 200ms) is not enough? It should be increased further?

DChiesa

I don't know about 2 orders of magnitude increase. I'm just trying to explain my recommendation. Better to give a delayed response than no response .As for why it takes more than 200ms sometimes, that is difficult to answer. One would need to examine the conditions on the MP node. You said you increased memory, but it's possible either that memory is not the problem, or, that increasing memory merely delays the onset of symptoms. Without doing a benchmark-type analysis, it will be difficult to know the reason for the delay.

For example, it could be that your system is under significant load when you see the delay. In which case all calls are subject to the effects of queuing and contention, even calls to JavaScript callouts. It could be that the system is not under much load at all, which means that caches are not warm, connections are not pooled, and everything takes extra time. It could be that there is too much heap available for the MP, and when a garbage collection event occurs, it "stops the world" for too long. (Tuning JVM for optimal GC behavior could be the subject of a doctoral thesis) Or it could be something else.

I won't be able to diagnose over the internet, but I can suggest as a start:

- check machine vitals on the MP - CPU, network, memory

- examine the similar stats on the supporting systems including C* and backend targets

- collect stats on every transaction to produce charts, to see how the variation changes over time. Does it grow steadily only to sink suddenly?

- connect a jconsole or similar to the MP JVM and examine GC events. Do they correlate with the delays in your JS Code?

Good luck!

kurtkanaskie

@Eric Dahl, @Dino Hmmm, seeing this too just recently. Sometimes the script runs in 4ms and others over 800ms. This is for the same GET /ping request run 20 times consecutively via JMeter. The script is checking headers and params for XSS and is not that long (< 40 lines).

Report Inappropriate Content

Hope it could help, we faced similar problems in our organization and it was due to Cassandra not well configured. If you are accessing KVMs or other data structures, please check this.

Report Inappropriate Content

After working with Apigee Support, we updated to use G1GC and increased the router & message-processor heap allocation. This has eliminated 99.9% of Javascript Timeouts.

tuhinsubhramaha

We are facing similar issue. Could you please help us. what is G1GC configuration you have updated?

Report Inappropriate Content

Here is the more details "http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html" to use G1 on the top of Java8

To enable the property go to the system.properties located on path /{INSt_DIR}/apigee/edge-message-processor/conf and change the "useG1GC" from false to true.