Re: 502 Bad Gateway | tcpdump Apigee X

shrenikkumar-s · 08-14-2023 02:04 PM

Hi All

We're getting 502 Bad Gateway error on our production at particular time regularly (say around midnight everyday) and this was not seen nonprod.

Error:

{"fault":{"faultstring":"Unexpected EOF at target","detail":{"errorcode":"messaging.adaptors.http.flow.UnexpectedEOFAtTarget","reason":"TARGET_READ_UNEXPECTED_EOF"}}}

We've gone through https://docs.apigee.com/api-platform/troubleshoot/runtime/502-bad-gateway and Appendix A section of links on this community and also raised a support ticket.

Looking at all this couple of things can help us at looking at this issue closely & hopefully resolving it:

1. Is this documentation applicable for Apigee X? I believe part of it should be such as diagnosis

2. Check the Message Processor logs: Apigee X is managed by Google and we can't check that I believe, am I right? If yes, can we ask Google on the support ticket? I mean can they get it?

3. (Most important) Collect the tcpdump output on the Message Processors: Again, I believe we do not have access to get the tcpdump on Apigee X. If I'm wrong, could somebody please help with the procedure to get them?

4. (Most important) Collect the tcpdump on the backend server. We've requested for it.

5. "reason":"TARGET_READ_UNEXPECTED_EOF" in the JSON error, can it help to point at the exact problem? In my understanding, it means the similar things explained in documentation that the Message Processor received an EOF while it was still waiting to read a response from the backend server

@kurtkanaskie @dchiesa1 @Sai Saran Vaidyanathan
@dknezic @ganadurai @Harish123 @Manisha_Chennu @Peeyush_Singhai @markjkelly

Kazim

Hello,

In my experience, in most of the cases I saw this is caused by the backend aka target application. When the backend application server is not working properly or having any issues which caused the interruption to the response. MP's syslog or backend's logs may provide more details.

Regards

shrenikkumar-s

Answers for few of them:

1. Apigee X relevant documentation on 502 Bad Gateway: https://cloud.google.com/apigee/docs/api-platform/errorcatalog/mp-runtime-errorcatalog#502-Unexpecte...

2. Google can help us get the Message Processor logs (can't do ourselves).

3. Google can help us collect the tcpdump output on the Message Processors (can't do ourselves).

ishitasaxena

Answers for your queries:-

1. Refer this documentation on 502 Bad Gateway:- https://cloud.google.com/apigee/docs/api-platform/errorcatalog/mp-runtime-errorcatalog#:~:text=TARGE...

2. With respect to this query:- ``Check the Message Processor logs: Apigee X is managed by Google and we can't check that I believe, am I right? If yes, can we ask Google on the support ticket? I mean can they get it?``

In Apigee X, both the management plane and the runtime are managed by Google Cloud Platform (GCP). As a result, you do not have direct access to the Message Processor logs. However, you can raise a support ticket with Google Cloud support and request information on logs.

3. With respect to your query:- ``Collect the tcpdump output on the Message Processors: Again, I believe we do not have access to get the tcpdump on Apigee X. If I'm wrong, could somebody please help with the procedure to get them?``

Raising a support ticket with Google to request a tcpdump on Apigee X is possible. However, it is important to note that tcpdump is not likely to be helpful in troubleshooting this particular error. This is because the error typically occurs due to an issue at the target end or the reason stated in point 5
In cases where this error occurs, the runtime logs will typically show that the connection with the target was successful. However, the target will then throw an error. In these cases, taking a tcpdump on Apigee X is unlikely to provide any useful information.
Tcpdump is primarily a tool for troubleshooting network-related issues. For example, if the trace or runtime logs do not provide sufficient information about why the connection between Apigee and the target server failed, and it is necessary to check the TLS handshake, then taking a tcpdump on Apigee X may be recommended.
However, it is important to note that even in these cases, the request for a tcpdump may or may not be considered, depending on the specific circumstances of the investigation.
Please note tcpdump is a valuable tool for troubleshooting network issues. However, its effectiveness is diminished in Apigee X and Hybrid due to the ubiquitous use of Transport Layer Security (TLS).

5. This error (TARGET_READ_UNEXPECTED_EOF) occurs under one of the following scenarios:

a) TargetServer is not properly configured to support TLS/SSL connections in Apigee.

b) The backend server may close the connection abruptly, while Apigee is waiting for a response from the backend server.

c) Keep alive timeouts configured incorrectly on Apigee and backend server.

shrenikkumar-s

Thanks @ishitasaxena, much appreciate your comprehensive response.

5. This error (TARGET_READ_UNEXPECTED_EOF) occurs under one of the following scenarios:

a) TargetServer is not properly configured to support TLS/SSL connections in Apigee.

[Shrenik: I guess this should be ok since not all requests fail but only a small %]

b) The backend server may close the connection abruptly, while Apigee is waiting for a response from the backend server.

[Shrenik: Yes this is where we're asking the backend server, what we've been told is that the backend server has a WAF where they're checking what could be happening]

c) Keep alive timeouts configured incorrectly on Apigee and backend server.

[Shrenik: Interesting, how do we find this out? I mean default I guess is 55 secs. I see 502 happening within just few ms, say 35ms and 85ms, so again I think this is not an issue]

dchiesa1

error on our production at particular time regularly (say around midnight everyday)

If there is a cadence to the error - in other words it appears and then disappears, around the same time, each day - then the source of the error is likely not the static unchanging configuration, like TargetServer TLS.

You've observed that Apigee itself is not reaching its 55s timeout. It's under 1 second. The 502 is not a system aburptly closing a connection. It's a system actively rejecting the request. This suggests that there is a network device or system, somewhere between your Apigee and the target (possibly including the target), that is going actively "offline" at a particular time of day. It could be a network switch or router. For example, if there is a scheduled job ("cron job") that applies updated configuration to a WAF or router, every night at midnight, it might cause a service disruption resulting in 502 errors. Maybe this happens only in prod because prod networks are "more important". It could be some other scheduled task. It could be some cron job that actively resets or reboots a software-based router. Or maybe it's not a network device or system, maybe it's the actual target responding with 502. (That would be unusual).

You can use the HTTP header "x-cloud-trace-context" to correlate the request originating from Apigee to the request received upstream. Suppose there is a request handled at Apigee that sees 502 around midnight. Check the x-cloud-trace-context header on Apigee, and see if you can find an inbound request at the target with the same value for x-cloud-trace-context. If you can find that, then it means the target has received the request and responded with 502. Probably unlikely.

More likely is, there is an intervening network system - check those for the x-cloud-trace-context header you see on Apigee.

If the problem really does happen around midnight, every day, then you could set up a scheduled task to start a debug session programmatically just a few minutes before midnight. You can even specify a condition "trace only requests that result in 502 status code". Then the next day you will be able to download that trace session, and examine a batch of failed requests.

shrenikkumar-s

Thanks for your valuable insights @dchiesa1

I've forwarded the same to the target system who is trying to investigate, I hope they find it useful.

We're filtering all the headers so we can't really get x-cloud-trace-context from the older logs (and also can't get from trace bcoz of this) but we can add that to the filter config and definitely try using that.

And your other point is interesting: Set up a scheduled task to start a debug session programmatically just a few minutes before midnight with condition "trace only requests that result in 502 status code".

How do we Set up a scheduled task? Is there an example or written code around that, kindly share if you've 🙂

dchiesa1

what I mean by "Scheduled task" is a task scheduled via Google Cloud Scheduler. Actually it's a little more involved.

FIRST, you need logic that tells Apigee to create a debugsession. This will be a bash script, or a Python script, or a nodejs app, or a Windows powershell script, etc. Your choice. And inside the script, you're just invoking the Apigee API to create a debugsession. (doc here) Pseudo code example:

POST :apigee/v1/organizations/:org/environments/:env/apis/:api/revisions/:rev/debugsessions
Authorization: Bearer :token
content-type: application/json

{
  "filter": "response.status.code = 502",
  "timeout": "600"
}

SECOND, You need to put that logic somewhere. That will be Cloud Run. With the gcloud command line tool, you can create a Cloud Run job from a source directory, Cloud Run will know what to do with it. Do this like so:

PROJECT_ID=my-google-cloud-project-that-will-run-the-job
gcloud config set core/project $PROJECT_ID
JOB_NAME=debugsession-creation-job
JOB_SERVICE_ACCOUNT=${JOB_NAME}-sa

# create a job-specific service account
gcloud iam service-accounts create ${JOB_SERVICE_ACCOUNT}
JOB_SA_EMAIL=${JOB_SERVICE_ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com

# Add Environment Admin role to the SA, to allow it to create a debug session.
# Can also use a custom role; the required permission is
# apigee.tracesessions.create .
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${JOB_SA_EMAIL}" \
  --role="roles/apigee.environmentAdmin"

# add SA user to yourself, to allow creation of the cloud run job which acts
# as this SA. If you have Editor or Owner role in the project, you already 
# have the required permission.
MY_USER=me@example.com
gcloud iam service-accounts add-iam-policy-binding ${JOB_SA_EMAIL} \
    --member user:${MY_USER} \
    --role "roles/iam.serviceAccountUser"

# create the job that runs with that service account
REGION=us-west1
gcloud run jobs deploy ${JOB_NAME} \
    --source . \
    --tasks 1 \
    --max-retries 3 \
    --region ${REGION} \
    --project=${PROJECT_ID} \
    --service-account ${JOB_SA_EMAIL}

Ok that creates a job that runs in the cloud, from your source code. When that job runs, Apigee will create the debugsession, and will begin collecting debug data for transactions, just as if you created it within the UI. Then Apigee will retain the session data for 24 hours. So later, you can sign in to the Admin console and view the debugsession for that particular proxy.

THIRD, you want to tell Google Cloud to run that job every day, just before midnight. For that you configure Google Cloud Scheduler using the HTTP target type. It will send out a HTTP GET or POST request, to a URL that you designate. In your case, the URL will be a special URL that triggers your Cloud Run job, and you will set the schedule to be "every day at 23:56" or similar. (You can use Cron Guru to figure out how to set the schedule.) Setting this up looks like so:

SCHEDULER_JOB_NAME=debugsession-job-trigger

# create service account for the scheduler
SCHEDULER_SERVICE_ACCOUNT="${SCHEDULER_JOB_NAME}-sa"
gcloud iam service-accounts create ${SCHEDULER_SERVICE_ACCOUNT}

SCHEDULER_SA_EMAIL=${SCHEDULER_SERVICE_ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com

# grant it rights to invoke your Cloud Run job
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${SCHEDULER_SA_EMAIL}" \
  --role="roles/run.invoker"

# every day, at 23:56
SCHEDULE="56 23 * * *"

gcloud scheduler jobs create http ${SCHEDULER_JOB_NAME} \
  --location ${REGION} \
  --schedule="${SCHEDULE}" \
  --uri="https://${REGION}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${PROJECT_ID}/jobs/${JOB_NAME}:run" \
  --http-method POST \
  --oauth-service-account-email ${SCHEDULER_SA_EMAIL}

The only trick here is ... how do you get the token to use within your Cloud Run job.... the token that gets sent to apigee.googleapis.com in the request to create a debug session. Actually it's very simple. If you followed the instructions above, you configured the Cloud Run job to run as a particular Service Account. And we added the Environment Admin role to that SA. To get a token for that service account from within a cloud run job, you can just invoke a particular endpoint from within your cloud run logic:

 curl http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token \
      -H "Metadata-Flavor:Google"

Of course, substitute something else for curl if you are using Java or nodejs or Python, etc. The point is, just send a GET to that URL (with the required header). The response will be a JSON payload with an access_token field. Extract that, then insert it into the request I showed above, to create a debug session.

Whew.

It sounds a little involved, but it's really just 2 pieces: a Cloud Run job that invokes the Apigee API to create a debugsession, and a scheduler job to trigger the cloud run job on a repeated schedule. The other stuff is just service accounts that allow you to do this securely, without storing secrets in the source code for your Cloud Run job.

To tear it all down,

gcloud scheduler jobs delete  ${SCHEDULER_JOB_NAME}  --location ${REGION}
gcloud iam service-accounts delete ${SCHEDULER_SERVICE_ACCOUNT}
gcloud run jobs delete ${JOB_NAME}
gcloud iam service-accounts delete ${JOB_SERVICE_ACCOUNT}

By the way, this is all really cheap. Cloud Run and Scheduler have free tiers, so probably you will not be charged anything for using this.