Solved: Has anyone encountered a "hung client" even though...

Report Inappropriate Content · 07-20-2015 01:02 PM

We have a strange situation where every other response from our API doesn't make it back to the client. The message response is reasonable at about 120K. Running trace we see that the response comes back from Apigee and a 200 OK is shown, but the client never gets the result. We tried in Postman and curl with the same results. Its pretty repeatable.

I've looked at the system.log files for the Message Processor and it shows "returning 250 rows" which is the limit I set in the API call. I've also downloaded the trace results and they look fine.

It works fine when running trireme on my Mac, but just not when its in Apigee.

We are running v4.15.04.00.

Its as if Apigee lost connection to the client some how. Has anyone else seen this?

Report Inappropriate Content

Working with Apigee Support we changed the router.properties to use no limit instead of 10 by setting the following property.

HTTPServer.streaming.buffer.limit=0

Thanks Apigee Support!

View solution in original post

Report Inappropriate Content

Are you saying that message processor logs doesn't show you any problem ? Can you enable debug mode on both Routers and Mps ? You can enable the debug mode using the management APIs.

Report Inappropriate Content

The MP logs look OK and the Trace downloads look OK. I didn't do that for the Router. I turned on debug mode by:

1. Editing logback.xml in <installl_root>/apigee4/conf/apigee/message-processor

change log.level references from INFO to DEBUG and restart the MPs.

2. Adding <EnvironmentVariable name="NODE_DEBUG">net</EnvironmentVariable> to the Target endpoint config.

By using the Management API to turn on debub, are you referring to "Create a debug session"?

Report Inappropriate Content

You can use the below APIs // 8080 on mgmt servers , 8081 on routers and 8082 on Mps

To Enable

curl -v -u sysadmin_user:sysadmin_pass -X POST "http://localhost:8080/v1/logsessions?session=logsession_name"

To Download

curl -v -u sysadmin_user:sysadmin_pass -X GET "http://localhost:8080/v1/logsessions/logsession_name" -o /tmp/debug.zip Then unzip /tmp/debug.zip and view the log.

To disable

curl -v -u sysadmin_user:sysadmin_pass -X DELETE "http://localhost:8080/v1/logsessions/logsession_name"

Report Inappropriate Content

Totally do this. I would be interested to see if the connection is being timed out at the router after the message processor thinks it has sent the response (a connection reaper message w/ the client ip address in it at the same time the client finally times out).

We had some problems with this sort of behavior - which was COINCIDENTALLY with a node.js proxy - but I dont want to get jumpy and see my latest pain behind every rock :)

Report Inappropriate Content

Ah hah!

First, thanks for the tips on the log sessions, way easier to use than the other approach. I scripted it so I can just run a debug session in one go, still I get lots of other noise in the log files, but I digress.

So I ran the test with 6 API calls and test 2 and 4 failed, I then dissected the log files by test so I could analyze, my brain doesn't handle big data well.

At the end of the ro-test2.log (failed API call) I see:

state changed from REQUEST_COMPLETE_AND_RESPONSE_IN_PROGRESS to CLOSED on event TARGET_TIMED_OUT

In ro-test3.log (success API call) I see:

state changed from REQUEST_COMPLETE_AND_RESPONSE_IN_PROGRESS to START on event RESPONSE_COMPLETE

At the bottom of the mp-test2.log I see lots of:

12:56:14.868 [Trireme: merck__test__launchpad__launchpad.js] DEBUG i.a.t.core.internal.ScriptRunner - mainLoop: sleeping for 717 pinCount = 33

12:56:15.587 [Trireme: merck__test__launchpad__launchpad.js] DEBUG i.a.t.core.internal.ScriptRunner - mainLoop: sleeping for 58025 pinCount = 33

So it appears that the Router times out with the Message Processor. This happens even in a config that has them both on the same node.

OK, so now what :)?

Report Inappropriate Content

Working with Apigee Support we changed the router.properties to use no limit instead of 10 by setting the following property.

HTTPServer.streaming.buffer.limit=0

Thanks Apigee Support!

Report Inappropriate Content

I had a feeling this was going to be the problem but I did not want to seem too jumpy about it. Im glad you got this resolved.

If you have any idea on how to monitor trouble related to this please share!

Report Inappropriate Content

Also - we were told that setting this to 0 turns OFF buffering - rather than removing the limit.

I am waiting for a 15.04 patch that will fix the bug that was the root cause of this because of that. Make sure you get that detail clarified.

Report Inappropriate Content

It is listed as fixed in the 15.04.03 patch which was released July 24th. I haven't applied the patch yet.

Report Inappropriate Content

Yup. we are in the process off testing this now. The patch came out late on the 24th.

Report Inappropriate Content

Please let me know how it goes, I applied the patch but the UI didn't come back up. I filed a support case.

I ran the revert script and that worked well, first time I got to try that 🙂

Has anyone encountered a "hung client" even though Apigee returns OK using trireme-jdbc based proxy?