IdleTracker.CloseIdleConnections during load test

Not applicable

This is a long post, so I'll start with the two main questions and then you can see all of the background below:

1. What can I do to get the GC logs to have timestamps? Can we use G1, can we tune? I saw a number of failures (timeouts) with the javascript policies. These normally execute in under 1ms, but were non-deterministically timing-out at 200ms, seemed to line up with full GC?

2. More importantly, my load test started showing a whole bunch of failures, centered around the: IdleTracker.closeIdleConnections() resulting in http 500 failures. Almost like the connection was marked as idle, when in reality it was executing.

Background:

On Prem install version 4.15.07.00

6 routers and message processors (co-hosted) - 50GB memory, 24 core per host

cassandra and zookeeper cohosted on 3 hosts

Test

We load tested with POST requests to a single proxy performing:

* KeyValue map lookup for a configuration setting

* javascript policy to perform some logic based on the K/V map lookup (200ms timeout)

* Api Key verification

* OAuth access token verification

* javascript to rewrite target URL (200ms timeout)

* AssignMessage policy to set a few headers

Each POST had a varying body ranging from 30KB to about 1MB.

The target endpoint was a simple no-op "info" endpoint returning service version, load balanced across 9 processing nodes.

Ramped up from 200 users to 1000 users over 10 minutes, peaking at 6000 requests/sec.

Network route: Load Balancer -> Apigee -> Load Balancer (VIP) -> 9 processing target nodes

Results:

* I first encountered a network blip between apigee and the VIP resulting in a bunch of 503s under light load.

** I don't think this was an apigee problem as the tests were re-run with a much heavier load and this problem was not seen again.

* I saw non-deterministic javascript execution timeouts. These timeouts seemed to correspond to full GC cycles, but the apigee GC logs have no timestamps?

* The more troubling issue that was detected occurred at the end of our tests under the highest load. I started to see a flurry of 500 errors:

** 2016-02-18 01:28:27,954 org: env: NIOThread@17 INFO NIO - IdleTracker.closeIdleConnections() : removing 43 idle channels from list of 3657

** 2016-02-18 01:28:27,966 org: env: NIOThread@6 INFO NIO - IdleTracker.closeIdleConnections() : removing 48 idle channels from list of 240

These error messages seemed to line up with various "broken pipe" type http errors that the client was reporting and show up in the apigee system log for the MP.

(excerpt):

2016-02-18 01:28:27,954 org: env: NIOThread@17 INFO NIO - IdleTracker.closeIdleConnections() : removing 43 idle channels from list of 3657 2016-02-18 01:28:27,966 org: env: NIOThread@6 INFO NIO - IdleTracker.closeIdleConnections() : removing 48 idle channels from list of 240 2016-02-18 01:28:27,969 org: env: NIOThread@6 ERROR HTTP.CLIENT - HTTPClient$Context.onInputException() : ClientChannel[C:172.18.145.43:80 Remote host:0.0.0.0:53720]@123201 useCount=5481 bytesRead=0 bytesWritten=1801 age=1241163ms lastIO=13ms .onExceptionRead exception: {} java.lang.NullPointerException: null at com.apigee.nio.channels.ChunkedInputChannel.doRead(ChunkedInputChannel.java:79) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.InputChannel.read(InputChannel.java:97) [nio-1.0.0.jar:na] at com.apigee.protocol.http.TrackingInputChannel.doRead(TrackingInputChannel.java:40) ~[http-1.0.0.jar:na] at com.apigee.nio.channels.InputChannel.read(InputChannel.java:97) [nio-1.0.0.jar:na] at com.apigee.nio.util.Bytes.readChunk(Bytes.java:155) [nio-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.onWrite(BodyPipe.java:148) [http-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.start(BodyPipe.java:100) [http-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:258) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:240) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:235) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:230) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.runtime.execution.EnqueueToParsePayload$1.run(EnqueueToParsePayload.java:68) [message-processor-1.0.0.jar:na] at com.apigee.nio.NIOSelector.runTasks(NIOSelector.java:317) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector.access$000(NIOSelector.java:29) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:265) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:256) [nio-1.0.0.jar:na] at com.apigee.nio.util.NonNullIterator.computeNext(NonNullIterator.java:16) [nio-1.0.0.jar:na] at com.apigee.nio.util.AbstractIterator.hasNext(AbstractIterator.java:36) [nio-1.0.0.jar:na] at com.apigee.nio.handlers.NIOThread.run(NIOThread.java:59) [nio-1.0.0.jar:na] 2016-02-18 01:28:27,969 org: env: NIOThread@6 INFO HTTP.CLIENT - HTTPClient$Context.logContextDetails() : Request details : host=null path=/_internal/info/ method=GET. Channel details : Bytes read=0 2016-02-18 01:28:27,970 org: env: NIOThread@6 ERROR ADAPTORS.HTTP.FLOW - AbstractResponseListener.onException() : AbstractResponseListener.onError(HTTPResponse@2204e2d1, null) 2016-02-18 01:28:27,971 org: env: NIOThread@6 INFO HTTP.SERVICE - ExceptionHandler.handleException() : Exception java.lang.NullPointerException occurred while writing to channel null 2016-02-18 01:28:27,972 org: env: NIOThread@6 INFO HTTP.SERVICE - ExceptionHandler.handleException() : Exception trace: java.lang.NullPointerException: null at com.apigee.nio.channels.ChunkedInputChannel.doRead(ChunkedInputChannel.java:79) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.InputChannel.read(InputChannel.java:97) ~[nio-1.0.0.jar:na] at com.apigee.protocol.http.TrackingInputChannel.doRead(TrackingInputChannel.java:40) ~[http-1.0.0.jar:na] at com.apigee.nio.channels.InputChannel.read(InputChannel.java:97) ~[nio-1.0.0.jar:na] at com.apigee.nio.util.Bytes.readChunk(Bytes.java:155) ~[nio-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.onWrite(BodyPipe.java:148) [http-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.start(BodyPipe.java:100) [http-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:258) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:240) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:235) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.message.HttpTransportMessage.parsePayload(HttpTransportMessage.java:230) [http-adaptor-1.0.0.jar:na] at com.apigee.messaging.runtime.execution.EnqueueToParsePayload$1.run(EnqueueToParsePayload.java:68) [message-processor-1.0.0.jar:na] at com.apigee.nio.NIOSelector.runTasks(NIOSelector.java:317) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector.access$000(NIOSelector.java:29) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:265) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:256) [nio-1.0.0.jar:na] at com.apigee.nio.util.NonNullIterator.computeNext(NonNullIterator.java:16) [nio-1.0.0.jar:na] at com.apigee.nio.util.AbstractIterator.hasNext(AbstractIterator.java:36) [nio-1.0.0.jar:na] at com.apigee.nio.handlers.NIOThread.run(NIOThread.java:59) [nio-1.0.0.jar:na] 2016-02-18 01:28:27,973 org: env: NIOThread@6 INFO HTTP.SERVICE - ExceptionHandler.handleException() : Closing the channel TrackingInputChannel(ChunkedInputChannel(ClientInputChannel(ClientChannel[C:172.18.145.43:80 Remote host:0.0.0.0:53720]@123201 useCount=5481 bytesRead=0 bytesWritten=1801 age=1241167ms lastIO=17ms ))) 2016-02-18 01:28:27,973 org: env: NIOThread@6 INFO HTTP.CLIENT - HTTPClient$Context.onInputClose() : Timestamp overwritten for messageid apirmp03.internal.shutterfly.com_CUR4MPGHU_RouterProxy-2-1447453_21, input ClientChannel[C:172.18.145.43:80 Remote host:0.0.0.0:53720]@123201 useCount=5481 bytesRead=0 bytesWritten=1801 age=1241167ms lastIO=17ms 2016-02-18 01:28:27,974 org: env: NIOThread@6 INFO NIO - IdleTracker.closeIdleConnections() : removing 1 idle channels from list of 1 2016-02-18 01:28:27,975 org: env: NIOThread@6 INFO NIO - IdleTracker.closeIdleConnections() : removing 1 idle channels from list of 1 2016-02-18 01:28:27,976 org: env: NIOThread@6 ERROR HTTP.CLIENT - HTTPClient$Context.onInputException() : ClientChannel[C:172.18.145.43:80 Remote host:0.0.0.0:53249]@122884 useCount=9296 bytesRead=0 bytesWritten=19636 age=1887550ms lastIO=17ms .onExceptionRead exception: {} java.lang.NullPointerException: null at com.apigee.nio.channels.ChunkedInputChannel.doRead(ChunkedInputChannel.java:79) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.InputChannel.read(InputChannel.java:97) [nio-1.0.0.jar:na]

0 5 281
5 REPLIES 5

Not applicable

@Birute Awasthi ( you said I could tag you 🙂 )

Regarding your first question, here are some GC specific JVM options that might help with printing the timestamp and giving you more information,

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-XX:+PrintGCDateStamps

-XX:+PrintGCApplicationStoppedTime

-XX:+PrintGCApplicationConcurrentTime

Right, I understand the GC options, just wasn't sure where the correct place was to define these.

For question 2: Are you able to use the Trace UI function to capture the API call which results in the 500 error seen on your client?

Unfortunately, the logs above only tell us that the Message Processor thread is waiting to read a response from your target endpoint address 172.18.145.43:80 and we do not log if this exception resulted in a status code of 500. Another suggestion is to enable debug mode on your message processor to get more logs to help with understanding the root cause of the 500 errors.

Note: the INFO messages about closeIdleConnections is normal. The Message Processor would reap connections once the keep alive timeout hits when there is no further request/response on that connection.

I couldn't trace as we were running a load test consisting of about 6000 requests per second. I'll see if we can run the test again in debug mode to collect additional data.