Analytics timeout

We are using on-prem Apigee, currently at version 4.19.01. We have had this issue for quite a while, but as upgrading did not resolve it, we need to find a root cause.

What happens is that the UI (?) times out while querying for analytics data. Example; when I go to Analytics, Traffic Composition - the default page (This Month) loads "Top 10 (Proxies, Products, Developers) Traffic" just fine. But "Top 10 Apps Traffic" does not load, and no timeout message is displays; the component just reads "Loading data..." indefinetly.

Then, if I click "2 Months", I get an error message after approximately 1 minute; "Error fetching analytics data. Unknown error.", and the affected components read "Error fetching data". Still, "Top 10 Apps Traffic" reads "Loading data..." and doesn't seem to time out ever.

We also get this error in other analytics components.

I found this page https://docs.apigee.com/api-platform/troubleshoot/analytics/analytics-reports-timing-out.html but the error message stated there isn't the same as we are getting.

My colleague says he has also had this happen when accessing analytics data from the API, so I think it's not an Edge UI issue.

I've had our operations team get me monitoring stats from the Pgres and qpid servers, and while there are minor spikes in both CPU and IOPS, I don't think it's a capacity issue either.

What gives? Anyone seen something similar before? Should we open a support issue for this?

Best regards,

Thomas Qvidahl

0 4 587
4 REPLIES 4

Not applicable

@Thomas Qvidahl yes, we have faced this issue earlier. we faced this because of two different reasons in different cases.

in one case the Postgres sender and receivers were not working, the standby was not able to receive the data from maser. So the data fetching was taking a long time with the error. You can check whether the communication between both the master and standby is working fine and no huge queries are going on in the background.

another case which we faced after the upgrade, we found the indexing was missing in one of the Postgres DB. so we had to create indexing using few commands. Apigee supported us with creating the indexes. Once the indexing happened the issue automatically got resolved.

you can find this ticket in apigee support. We had raised that last year sometime.

cloud
New Member

Mr Thomas - I've seen this in production. Lower environments do not have this issue, obviously load-based.

Support mentioned the following (for my timeout ticket)

  1. Check postgres cpu/mem/load
  2. Check connectivity between mgmt to qpid/pg
  3. In browser-developer tools, find out which call is failing - data fetch vs. reports page
  4. Execute the mgmt-api, check qpid servers & queue depth

I've resorted to following

  • mgmt apis for stats
  • exporting db to lower-env for post-processing
  • 3rd-party monitoring tools - data-dog, splunk, newrelic, ... (insert your own monitoring)

Side note -- We've tremendous growth in filesystem, specifically pg_log

1. There is not much load on servers.

2. Connectivity is good.

3. See separate answer.

4. I will look into this.

Replying to @sunil soprey:

I can see in browser dev tools that there are two calls that fail with a 502:

https://<domain>/ws/proxy/organizations/<organization>/environments/prod/stats/apiproxy?_optimized=j...