Resolved: Unable to Deploy Proxies to Message Processor, Call Timeout

Not applicable

Wanted to post an issue and solution we recently encountered. We are on-prem customers running 4.15.07.03. We faced an issue last week where we were unable to deploy any proxy to our planet. During deployment, the management portal would take a long time to respond and eventually show an error saying that the proxy was deployed but traffic might not flow, etc. After hitting the management api, we could see our proxy was deployed to only one of our two message processors.

/v1/organizations/<org>/environments/<env>/apis/<proxy>/revisions/1/deployments

    "server": [
        {
            "status": "deployed",
            "type": [
                "message-processor"
            ],
            "uUID": "5969f7e8-a32b-45e3-87f5-8b982bc2bf24"
        },
        {
            "error":  "Call timed out;  either  server  is  down  or  server  is  not reachable",
            "status": "error",
            "type": [
                "message-processor"
          ],
            "uUID": "2cfbf65f-4ed0-4f17-99f4-811bce25b39e"
        },

We also saw the following error message in the management server logs...

2016-01-01 00:05:52,222 org:nminternal env:<env> qtp451189693-42887 ERROR DISTRIBUTION - RemoteServicesUnDeploymentHandler.unDeployFromServers() : RemoteServicesUnDeploymentHandler.unDeployFromServers : UnDeployment exception for server with uuid 2cfbf65f-4ed0-4f17-99f4-811bce25b39e : cause = RPC Error 504: Call timed out communication error = true 
com.apigee.rpc.RPCException: Call timed out 
at com.apigee.rpc.impl.AbstractCallerImpl.handleTimeout(AbstractCallerImpl.java:64) ~[rpc-1.0.0.jar:na] 
at com.apigee.rpc.impl.RPCMachineImpl$OutgoingCall.handleTimeout(RPCMachineImpl.java:483) ~[rpc-1.0.0.jar:na] 
at com.apigee.rpc.impl.RPCMachineImpl$OutgoingCall.access$000(RPCMachineImpl.java:402) ~[rpc-1.0.0.jar:na] 
at com.apigee.rpc.impl.RPCMachineImpl$OutgoingCall$1.run(RPCMachineImpl.java:437) ~[rpc-1.0.0.jar:na] 
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:532) ~[netty-all-4.0.0.CR1.jar:na] 
at io.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:430) ~[netty-all-4.0.0.CR1.jar:na] 
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:371) ~[netty-all-4.0.0.CR1.jar:na] 
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_91]

We resolved this issue by recycling both the management server and the impacted message processor.

3 10 1,404
10 REPLIES 10

Thank you @Steven Wolfe for sharing the solution with community. I am sure it will be helpful for others too.

Not applicable

Thank you for sharing your experience @Steven Wolfe! I, too, have encountered this issue in our various clusters and came up with similar fix.

NOTE: This behavior was observed with OPDK 4.15.07.01.

The situation improved markedly by increasing the value of rpc.timeout in ${INSTALL_ROOT}/apigee4/conf/apigee/management-server/cluster.properties, restarting MS services after modification.

One of my current projects is to determine how to trigger the configuration refresh manually and avoid service restarts.

@Jason Harrington If you come up with a good way to do this please share.

this is driving me crazy.

Just an FYI that we are also noticing the partial deploy or undelpoy, occasionally running on 4.16.01, so sadly the issue still exists. I did grep the management servers logs and found this error related to the discussion.

UnDeployment exception for server with uuid e93ead52-cbd5-4e87-82fe-96aac9b015f0 : cause = RPC Error 504: Call timed out communication error = true

I saw the default setting for rpc.timeout=10.

What setting has been working for anyone that has encountered this?

I had increased the value of rpc.timeout to 40 (seconds), per advice from Apigee as part of a support case. It helped with some issues; however, I still have regular timeouts on one MP (which MP appears somewhat random, but I have no statistics to share). YMMV.

Thanks. Have you been able to determine if the MP where the rpc calls timeout are always in a remote datacenter from the managment server? I seemed to recall that the logs show the server uids but haven't checked myself. I will be looking closer and can let you know what I find.

Not applicable

Yes, I just saw your post. Our issue was #4.

Thank you for the link, @Benjamin Goldman! Still working out which bucket my recent deployment failures fall into. Increasing timeout just makes it take longer to receive bad news. 😉

Not applicable

Hi,

We encountered similar issue and were able to resolve by using following steps.

1) Validate hostname -i returns IP address, not 127.0.0.1.

2) Restart all servers.

Verify each component is registered to management server with respective IP address and not using 127.0.0.1

curl -u username:password -X GET http://localhost:8080/v1/servers?pod=central