getting broken pipe error in system.log

krupalpatelusa · 06-25-2018 08:59 AM

2018-06-25 08:00:43,268 org:srv env:prod api:-api rev:2 messageid:domain.srv.com-8680-196079-1 NIOThread@1 ERROR HTTP.SERVER - HTTPServer$Context$2.onException() : Message id:domain.srv.com-8680-196079-1 ClientOutputChannel(ClientChannel[A:10.10.10.12:8998 Remote host:10.10.10.12:49780]@63462 useCount=1 bytesRead=0 bytesWritten=300 age=18540ms lastIO=1ms ).onExceptionWrite exception:{} java.io.IOException: Broken pipe at com.apigee.nio.channels.ClientOutputChannel.writePending(ClientOutputChannel.java:51) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.OutputChannel.onWrite(OutputChannel.java:116) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.OutputChannel.write(OutputChannel.java:81) ~[nio-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.onWrite(BodyPipe.java:115) [http-1.0.0.jar:na] at com.apigee.protocol.http.io.BodyPipe.start(BodyPipe.java:104) [http-1.0.0.jar:na] at com.apigee.protocol.http.io.MessageWriter.onWrite(MessageWriter.java:171) [http-1.0.0.jar:na] at com.apigee.protocol.http.io.MessageWriter.<init>(MessageWriter.java:111) [http-1.0.0.jar:na] at com.apigee.protocol.http.HTTPServer$Context$2.<init>(HTTPServer.java:778) [http-1.0.0.jar:na] at com.apigee.protocol.http.HTTPServer$Context.send(HTTPServer.java:778) [http-1.0.0.jar:na] at com.apigee.messaging.adaptors.http.configuration.MessageProcessorHttpSkeletonFactory$FlowCompletionListener.onFinish(MessageProcessorHttpSkeletonFactory.java:550) [http-adaptor-1.0.0.jar:na] at com.apigee.flow.MessageFlowImpl.onCompletion(MessageFlowImpl.java:349) [message-flow-1.0.0.jar:na] at com.apigee.flow.MessageFlowImpl.resume(MessageFlowImpl.java:302) [message-flow-1.0.0.jar:na] at com.apigee.flow.execution.ExecutionContextImpl$1.run(ExecutionContextImpl.java:96) [message-flow-1.0.0.jar:na] at com.apigee.nio.NIOSelector.runTasks(NIOSelector.java:363) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector.access$000(NIOSelector.java:34) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:311) [nio-1.0.0.jar:na] at com.apigee.nio.NIOSelector$2.findNext(NIOSelector.java:302) [nio-1.0.0.jar:na] at com.apigee.nio.util.NonNullIterator.computeNext(NonNullIterator.java:21) [nio-1.0.0.jar:na] at com.apigee.nio.util.AbstractIterator.hasNext(AbstractIterator.java:47) [nio-1.0.0.jar:na] at com.apigee.nio.handlers.NIOThread.run(NIOThread.java:59) [nio-1.0.0.jar:na] Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_171] at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_171] at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_171] at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_171] at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_171] at com.apigee.nio.PlainTransport.write(PlainTransport.java:107) ~[nio-1.0.0.jar:na] at com.apigee.nio.ClientChannel.write(ClientChannel.java:317) ~[nio-1.0.0.jar:na] at com.apigee.nio.channels.ClientOutputChannel.writePending(ClientOutputChannel.java:47) ~[nio-1.0.0.jar:na] ... 19 common frames omitted

I am getting the above errors when the client tries to connect to the proxy :

The client continuous to get errors through out the day while, trying to call the proxy.

The apigee analytics doesn't report any of these errors.

dchiesa1

Check the Apigee Edge troubleshooting guide.

It looks like it might be a timeout between the message processor and... the router? When the proxy is writing the response. Some link in the chain has dropped the connection. It could be the router itself (nginx - check the logs there). It could be the end client - your code.

You said "The client continuous to get errors through out the day while, trying to call the proxy."

By "continuous" Do you mean that EVERY request by the client results in an error?

Or do you mean "intermittent" where SOME requests throughout the day result in errors?

I'm guessing you mean the latter.

If that is the case, then I suspect there is a sporadic timeout somewhere in the chain. If the client (your code) has a connection timeout of 18 seconds, it would hang up while the server (Apigee Edge) was still trying to send a response. I See in the log message "age=18540ms" which says 18.5 seconds. That's kind of a long time. Maybe the client gave up? In which case a broken pipe message in the Apigee Edge log file is expected.

Why would a request take 18 seconds? Normally Apigee Edge would not do this by itself. If there is something internal in Apigee Edge that is taking 18 seconds, like a VerifyApiKey or KVM lookup etc, then ... you need to examine the Apigee logs closely for some hints as to causes. Maybe the disk is filling up. Maybe the CPU is saturated. etc. Maybe the I/O is causing delays (network saturation?).

More typically a delay in an API request is caused by a delay exhibited by the backend system.

If a client sets a timeout of 18s, then calls Apigee Edge, and apigee Edge proxies to a backend service, and the backend service delays for 20 seconds.... then the client will disconnect, and the Apigee Edge will get a "broken pipe" and ... everything is happening as expected.

So you need to look a little more closely at the circumstances under which you are seeing these intermittent timeouts.

The Apigee Edge troubleshooting guide may provide some assistance and suggestions for you.

krupalpatelusa

Thanks for the answer.

Client faced frequent issues(not the continuous) due to delay in target server responses.

Client set a timeout of 19 seconds and the target server was taking more time than expected due to which this issue was coming up.

We didn't see any traces of these in Apigee proxy/custom reports. All showed 200s.

Only when we verified the apigee access logs and found out that the error code from nginx was 499, we started looking at the blogs which specifically mentioned about router timing out before the MP+Backend Server responds within the specified timeframe.

RCA of the issue :

We verified the default timeout of R and MP which was 57 and 55 resp.

So, Figured that client must have setup the timeout of 19 seconds.

Question:

1) Even in this case, It should have displayed as part of the report http 504 error codes but didn't . All we saw were 200s.

PS : R , MP's uses the default timeout strategy(57/55 secs).

2) What is the best strategy to increase the timeout period of R and MP?

Is it as good as setting up timeout period to 2 mins( example) and (1.55 mins) for MP?

Here is the link referred for scenario #3 that we faced:

https://docs.apigee.com/api-platform/troubleshoot/runtime/504-gateway-timeout

dchiesa1

In question 1, you said "all we saw were 200s." That's not a question. I guess your question, is, "WHY? I expect to see 504s; Why do I see only 200s?"

If that is your question I cannot answer that because I'm not clear on the situation.

As you diagnose the situation, remember this: analytics records get created by logic in the MP. If nginx is not able to contact the MP, then there is no AX record. I don't know how the 499s relate to the 504s or the 200s. You have better insight into that, than I do.

Q2. Should you raise the timeouts?

No. I highly advise against it. I don't believe it's a good idea to raise the timeout for API requests to 2 minutes. Having a 60s timeout is a good safe level. By raising timeouts, you are treating the symptom, not the cause, of the problem. Higher timeouts can lead to amplified cascading problems, especially at scale. I think you have other problems in the system and rather than raising timeouts, you should be working on ways to get the APIs to respond (including the backend) in less than 60 seconds. In fact, much much less than 60 seconds.

krupalpatelusa

Here is nginx log:

2018-06-25T18:00:16-04:00 xx.com192.xx.xxx.xxx:431 xxx.xxx.xxx.xxx:8998 15.013--499-42700 GET /api

From the logs,it seems like the port is 8998, I assume its coming at the router , Not sure though.

If it can help.

Q2 Further : Ideally, we should be able to get the response at max by 10 sec.

But situations like this specially when we are in production and the target server team needing a cushion would compel us to rethink our strategy to increasing the timeout period for a while.

By cascading effect , does it mean there are other components too who may need timeout configurations?

dchiesa1

"Cacading effect" means other systems will fail.

Suppose you have a throughput of 10 requests per second, with an average response time of 10 seconds. That means, at any given moment, there are 100 requests pending. That means the MP must maintain 100 buffers to hold the inbound request, 100 buffers for the responses, and a bunch of other supporting information. If each request is 1k and each response is 100k, now you have 100*100k just in response buffers.

Now suppose some of those requests don't finish in 10 seconds. They finish in 30 seconds. Or 45. That means the number of buffers the MP needs to manage, must increase. And also the management of those buffers gets more complicated. Memory gets fragmented, things need to move around. It all costs CPU and wall clock time.

Now suppose some of the requests take 60+ seconds. The 57s timeout is builtin to the system as a "relief valve". It allows the system to drop the buffers for requests that are stalled. This relieves memory and cpu and I/O pressure.

Now suppose you raise that timeout to 120 seconds. Now you've tightened the "relief valve". Now the system cannot drop older requests. It needs to continue to hold onto the request and response buffers. This leads to an out-of-memory situation which can cause service outage and denial of service to client apps.

It's a bad idea.

Normally in larger companies there is an architecture or governance team that can advise on these things. A 2-minute timeout for an HTTP request seems.... way too high. The server is not fit for online service, in my opinion, if it exhibits this behavior. It needs to be fixed THERE, at that layer, not at the apigee layer.

If the roof of your house is leaking, You don't try to fix the problem by opening an umbrella! You fix the roof !

btw, port 8998 is the inbound port to the MP. The router uses this to send requests to the MP. You can find this information in the operations guide for OPDK.

If the router is failing to send to the MP on 8998, it is possible you are already experiencing resource issues (denial of service) on the MP. Have you checked the cpu? memory? IO?

All of this is described in the Troubleshooting Guide !