Java Policy Classpath Issue and Null messageid flow variable - on-prem

Not applicable

Apigee On-Prem version 4.14.07.00 with multiple six routers/message processors behind a vip

I am debugging two problems at the moment and could use some insight:

1. I have a java policy that performs SHA256/512 hashing to provide digital signature verification for request payloads. The policy is compiled against the bouncycastle libary and I see that Apigee ships with this library as well. Running a functional test with 20 requests, I get a non-deterministic failure rate of about 10%. My java policy has a try/catch block that aborts and the exception I'm seeing is:

Exception during execution: java.lang.SecurityException: class "org.bouncycastle.crypto.digests.SHA256Digest"'s signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:806)

...

I had added the bouncycastle library to the org before I realized that Apigee provided it. I have since removed my library and undeployed/redeployed the proxy. This didn't solve the issue.

This seems to be a clear classpath issue, but I don't quite understand why the proxy succeeds 90% of the time. I'm assuming that this is a single message processor that is having difficulties. Other symptoms of note are that in trace mode, the failed calls will not show up in the tracing UI. My policy is being executed and I'm attaching the stack trace as a response header.

The other odd thing I found lead me to issue #2.

2. Null messageid. We have an AssignMessage policy that sets a response header with the messageid so we can easily trace problems. However, when I run my above test with 20 requests, I get 3 - 4 responses with a null messageid. The exception/failure from my first problem (classpath) always results in a null messageid, but I am also getting successful request/responses with a null messageid. When I trace the successful requests, the AssignMessage policy step shows no messageid. This is a little worrying. If I'm not mistaken, the router is what assigns the messageid, correct?

I do not have log files yet as I'm still working with my OPS team to get access. This is more of a fishing attempt to see if anything jumps out.

Solved Solved
0 3 512
2 ACCEPTED SOLUTIONS

adas
Participant V

@Jonathan Baney you mentioned that you added the jar to the org resources and then temoved it. Its possible that one of the message-processors is still holding on to that reference somehow. Since you have 6 of them and we donot know which one it is, can you try restarting all 6 of them and then see.

Regarding the null messageid thing, I am not sure if thats a bug, but I sort of remember it as a bug and we already have a patch for it. Let me confirm that so that you can apply all the latest patches for your onprem version. And you are right about the router assigning the messageid.

View solution in original post

Not applicable

I was able to get access to the logs. We have 6 message processors, so finding which instance was the offender (or instances) was interesting. The "normal" apigee system.log simply showed a 500 being returned, but did not give any additional information. Luckily we had created an enhanced tracing mechanism to allow deeper tracing with enhanced logging. With this, I was able to narrow down the problem to a single message processor.

I diff'd the bad instance with one that was functioning as desired. There were no config differences and I found no differences in the physical jars loaded.

The ops team restarted the bad instance and the problem disappeared. This seems to be a case of apigee holding onto the classpath even when a jar is unloaded. We are running 4.15.04.00 in our non-prod environment and I have not seen this issue, so perhaps it's confined to the older version.

View solution in original post

3 REPLIES 3

adas
Participant V

@Jonathan Baney you mentioned that you added the jar to the org resources and then temoved it. Its possible that one of the message-processors is still holding on to that reference somehow. Since you have 6 of them and we donot know which one it is, can you try restarting all 6 of them and then see.

Regarding the null messageid thing, I am not sure if thats a bug, but I sort of remember it as a bug and we already have a patch for it. Let me confirm that so that you can apply all the latest patches for your onprem version. And you are right about the router assigning the messageid.

Thanks for the input. We are going to attempt the restart today. I'll report back if it helped.

Not applicable

I was able to get access to the logs. We have 6 message processors, so finding which instance was the offender (or instances) was interesting. The "normal" apigee system.log simply showed a 500 being returned, but did not give any additional information. Luckily we had created an enhanced tracing mechanism to allow deeper tracing with enhanced logging. With this, I was able to narrow down the problem to a single message processor.

I diff'd the bad instance with one that was functioning as desired. There were no config differences and I found no differences in the physical jars loaded.

The ops team restarted the bad instance and the problem disappeared. This seems to be a case of apigee holding onto the classpath even when a jar is unloaded. We are running 4.15.04.00 in our non-prod environment and I have not seen this issue, so perhaps it's confined to the older version.