Apigee Edge cache occasionally returns old data

Hello there,

Case: I have an external api endpoint from which I'm only allowed to receive data once per minute. The apigee endpoint is receiving quite a bit of traffic roughly 2.000 per minute.

Solution: Since Apigee doesn't support some kind of cache warming mechanism where it will get the data and store it in the cache by itself I have used javascript policy's and key value maps to allow only one of the calls to go trough to the back-end and fill the response cache. All other calls will always receive the data from the response cache. This works and the logs show that indeed only one of the calls goes through to the backend every minute. This policy is on the target endpoint.

I've also added an additional response cache policy on the proxy endpoint to cache different query parameter combinations so the policy's on the target endpoint don't have to run for every single call. This policy simply works on a 60 second timeout.

Problem: What we've been noticing is that when to platform is under heavy load the api endpoint occasionally returns old cached data. So for example it has been running fine and updating the cache data every minute at 18:00 and then suddenly returns data from 16:31 and will start running fine again a couple minutes later without intervention. Our logging shows that the external endpoint was hit every minute during the timespan and it kept returning fresh data. The in between updated data also suggests that the policy's are doing there job correctly with filling and invalidating the cache.

We're running an on premise install of Apigee Edge Version 4.16.05.00.

Hopefully one of you bright minds is able to let his or her light shine on this case. 🙂

Solved Solved
2 6 694
1 ACCEPTED SOLUTION

We solved (evaded) the problem eventually. The problem was not related to the heavy load, but the size of the response which had to be cached (size and load always increased simultaneously due to the nature of the data).

Eventual solution from support:

Regarding caching issue - I've spent quite a lot of time looking into this and talking to Engineering team and I believe we may have found solution for it - basically at this point if the payload is bigger than 512kb message-processor may or may not cache it / properly propagate the cache across the cluster (inconsistent behaviour and this is a bug) however setting parameter "skipCacheIfElementSizeInKBExceeds" to higher value than 512kb seems to be fixing this problem (I tested this in my dev env with 2 message-processors).

Setting the property "skipCacheIfElementSizeInKBExceeds" to 2000 which forced the MP to cache payloads bigger than 2000kb in memory solved our issue.

View solution in original post

6 REPLIES 6

@Bart Waardenburg , You have mentioned Javascript Policy & KVM in above issue. Any reason you are using them ? Why not out of the box response cache policy with 60 seconds cache expiry value ? Can you add more details related to Javascript Policy & KVM ?

@Anil Sagar Yes that's what I used at first, but the way the cache policy works is that it keeps the response in cache for 60 seconds. After those 60 seconds it will invalidate it and all subsequent calls will go through to the backend until one of them returns and repopulates the cache. Which resulted in doing more calls on the backend when the backend needed more time to process the calls. Which in turn resulted in longer times to process and thus again more calls and by that effectively making sure the backend wasn't able to respond to any calls.

With the Javascript and KVM policies I can make sure that when the first call goes through to the backend all subsequent calls are being served with the 'old' cache.

Not applicable

@Bart Waardenburg

Could you provide code snippet related to Response Cache policies along with corresponding step configured as part of the endpoints. Provide detail of additional response policy as well.

@Bart Waardenburg

Interesting problem.

You wrote

So for example it has been running fine and updating the cache data every minute at 18:00 and then suddenly returns data from 16:31 and will start running fine again a couple minutes later without intervention.

A couple questions:

  1. This is while the system is serving 2000 requests per minute, is that right?
  2. how do you know the data is old? How do you know it is from 16:31?
  3. Is it possible that there is an intervening caching layer between the client and Apigee Edge, that is caching the response and returning old data? Have you set the cache headers (like Cache-Control: max-age=<seconds>) in the response that is sent from Apigee Edge so that any network devices between the client and Apigee (even browsers) know when to get fresh data?
  4. Can you reproduce the problem easily?
  5. You said "our logging shows"... What logging do you use on the backend? Do you have a high-throughput log aggregator like Stackdriver or Splunk that you can send data to, from the Apigee proxy? I'm thinking that correlating the logs from the backend with the logs from the proxy side, you will clearly see the exact request during which the problem occurs.

  1. This is while the system is serving 2000 requests per minute, is that right? Correct
  2. how do you know the data is old? How do you know it is from 16:31? The source feed provides a timestamp and this timestamp is also returned to our users. We know the source is updating correctly because this timestamp is up to date every minute in our logging system.
  3. Is it possible that there is an intervening caching layer between the client and Apigee Edge, that is caching the response and returning old data? Have you set the cache headers (like Cache-Control: max-age=<seconds>) in the response that is sent from Apigee Edge so that any network devices between the client and Apigee (even browsers) know when to get fresh data? I have not set any cache control headers. It can't harm to add them.
  4. Can you reproduce the problem easily? No I have not been able to reproduce it.
  5. You said "our logging shows"... What logging do you use on the backend? Do you have a high-throughput log aggregator like Stackdriver or Splunk that you can send data to, from the Apigee proxy? I'm thinking that correlating the logs from the backend with the logs from the proxy side, you will clearly see the exact request during which the problem occurs. We're logging to Splunk from within Apigee, but I'm currently only logging the errors and the call that goes trough to the backend (for volume reasons). But I could start logging every call for a couple hours to see more detailed data.

We solved (evaded) the problem eventually. The problem was not related to the heavy load, but the size of the response which had to be cached (size and load always increased simultaneously due to the nature of the data).

Eventual solution from support:

Regarding caching issue - I've spent quite a lot of time looking into this and talking to Engineering team and I believe we may have found solution for it - basically at this point if the payload is bigger than 512kb message-processor may or may not cache it / properly propagate the cache across the cluster (inconsistent behaviour and this is a bug) however setting parameter "skipCacheIfElementSizeInKBExceeds" to higher value than 512kb seems to be fixing this problem (I tested this in my dev env with 2 message-processors).

Setting the property "skipCacheIfElementSizeInKBExceeds" to 2000 which forced the MP to cache payloads bigger than 2000kb in memory solved our issue.