Gateway timeout error

We have received below error in prd & based on logs it shows it timeout at router level after 57 sec.During the support call it indicates the request reaches the Router and the Router connects to the MP as well but since the MP is not responding within a specific time, the Router closes the connection and sends back a 504 response to the client

Client received below error:

{"fault":{"faultstring":"Gateway timeout","detail":{"code":"GATEWAY_TIMEOUT"}}}

router log:

2016-07-02 20:31:34,756  Router-ClientThread-0-0 ERROR Proxy-session - RouterProxySession$ClientContext.onTimeout() : Message Id: xxx-xx-xxx.xxx.xxx_BVYhDJbU_RouterProxy-6-269973_9 Session state: CLOSED MP channel [id: 0xb5e0a114, 0.0.0.0/0.0.0.0:30342 :> xxx-xx-xxx.xxx.xxx/yy.aaa.b.xx:8998] timed out

Router Timeout.

==

ServerContainer.io.timeout.millis=58000

==

Support recommended to increase above value. Issue is intermittent & hard to replicate.

How to identify why MP is not responding within specific time? We can't enable the debug. We are trying to replicate the issue in non-prd but is there a better solution/ should we proceed with tuning above parameter.

Please suggest.

-Vinay

0 5 900
5 REPLIES 5

Not applicable

@vinay poreddy what do you see in the trace?


if you see 504 in the trace , it implies that the MP is timing out waiting for a response from the backend/upstream.

If you see a lot of time being spent on any policy, it implies some issue with your code or that particular MP and routers are timing out.

Trace should provide you more info.

We verified the load information but didn't see any issue with load.The transaction didn't even hand off to MP..It timed out at router level as MP didn't respond.It is hard to debug with trace as it happened once.

Looking for ways to trap to identify the root cause.

@vinay poreddy,

If the router timed out before MP responded back, then you will most like see the below exception in the MP logs

java.nio.channels.ClosedChannelException

Can you please check if you see the above exception is seen in the MP logs when the 504 error occurred ?

If you do see, then you might be able to get message_id for the request, using which you can try to query the PG database to check the target response time and request processing time to see where most time was spent.

Regards,

Amar

We didn't get below error.

java.nio.channels.ClosedChannelException

From analytics we found below record & there is no message_id.

organization  | test
environment | prod
apiproxy | (not set)
request_uri | /v1/xxxx/yyyy?apikey=abc
proxy | (not set)
proxy_basepath | (not set)
request_verb | GET
request_size | -1
response_status_code | 504
is_error | 1
client_received_start_timestamp | 2016-07-03 02:30:31.58
client_received_end_timestamp | 2016-07-03 02:30:37.754
target_sent_start_timestamp |
target_sent_end_timestamp |
target_received_start_timestamp |
target_received_end_timestamp |
client_sent_start_timestamp | 2016-07-03 02:31:34.755
client_sent_end_timestamp | 2016-07-03 02:31:34.755
client_ip | (not set)
access_token | (not set)
client_id | (not set)
developer | (not set)
developer_app | (not set)
api_product | (not set)
flow_resource | (not set)
target | (not set)
target_url | (not set)
target_host | (not set)
apiproxy_revision | (not set)
proxy_pathsuffix | (not set)
proxy_client_ip | (not set)
target_basepath | (not set)
client_host | (not set)
target_ip | (not set)
request_path | /v1/xxxx/yyyy?apikey=abc
response_size | 79
developer_email | (not set)
virtual_host | (not set)
gateway_flow_id | xxx-xx-xxx.xx.yyy_dfdfd_RouterProxy-6-11
sla |
message_count | 1
total_response_time | 63175
request_processing_latency |
response_processing_latency |
target_response_time |
cache_hit |
x_forwarded_for_ip | (not set)
useragent | Java1.6.0_29
target_response_code |
groupid |
target_error |
policy_error |
ax_ua_device_category | Other
ax_ua_agent_type | Library
ax_ua_agent_family | Java
ax_ua_agent_version | 1.6
ax_ua_os_family | JVM
ax_ua_os_version | 1.6
ax_geo_city | (not set)
ax_geo_country | (not set)
ax_geo_continent | (not set)
ax_geo_timezone | (not set)
ax_session_id | (not set)
ax_market_id | (not set)
ax_partner_id | (not set)
ax_channel_id | (not set)
ax_business_unit_id | (not set)
ax_traffic_referral_id | (not set)
ax_device_id | (not set)
ax_client_org_name | (not set)
ax_client_app_name | (not set)
ax_client_request_id | (not set)
ax_created_time | 2016-07-03 02:31:35.10237
ax_hour_of_day | (not set)
ax_day_of_week | (not set)
ax_week_of_month | (not set)
ax_month_of_year | (not set)
gateway_source | router
ax_cache_executed |
ax_cache_name | (not set)
ax_cache_key | (not set)
ax_cache_source | (not set)
ax_cache_l1_count |
ax_edge_execution_fault_code | GATEWAY_TIMEOUT
ax_edge_is_apigee_fault | 1
ax_dn_region | (not set)
ax_execution_fault_policy_name | (not set)
ax_execution_fault_flow_name | (not set)
ax_execution_fault_flow_state | (not set)

This is not an answer, but we are also seeing the same issue.

All requests time out - but if the trace is turned on, everything starts running smoothly...for a while. And then we are back to the same timeout again.

We worked on increasing the java heap memory etc, and though they seemed to marginally improve the time before the transactions started timing out, we cannot be sure that they worked.

We had also earlier increased the time for which the connection stays open between the MP and the Router. Checking the MP logs doesn't show us anything at all.

Currently, we have this issue in our dev and uat envs. I hope this will not appear in the prod envs, because that might be a major issue.

Rebooting the servers seems to fix the issue for a longer perios of time than just turning the trace on - this leads me to believe that it might have something to do with the internal java heap/cache or something like that. But one cannot keep restarting the servers all the time.

There have been similar issues faced earlier by others (https://community.apigee.com/questions/10153/router-logs.html) but these issues don't seem to have progressed to a solution.

Linking @arghya das here since he has replied earlier to the issue linked above