Error rate threshold email notification?

Report Inappropriate Content · 11-19-2015 11:35 AM

We have a need to send notifications via email when error rates start creeping over a threshold %.

If the traffic error rate is 1% or Higher – If the Apigee Dashboard shows an Error Rate approaching or exceeding 1% (combined traffic) it may need some investigation.

I reviewed the question/answer listed here: https://community.apigee.com/questions/3847/alerting-based-on-errors-and-exceptions.html

This looks like something we could use if we see a 500 error then an alert can be sent. But that might send a bunch of emails. I don't think this works because we need to have something to persist in order to figure out the threshold etc.. (keep stats somehow in memory)

Or is this something that is best handled outside of Apigee (scom monitoring).

Or is there some way to tap into the reports to do this? http://apigee.com/docs/analytics-services/content/using-analytics-dashboards#analyzingspikesordropsi...- In particular I am looking at the section 'Viewing moving averages and alerts'

Thanks

Eric

mschreuder

Hi @Eric Renshaw

Today there is no straightforward mechanism to configure email notifications based on error rates.

I did want to let you know about our API Health monitoring capability that enables configuration of probes to send requests against APIs and/or API Proxies and look for errors and/or latency issues by configuring rules and triggering alerts. But I recognize this does not support error rate monitoring/alerting that you are looking for.

Another option would be that you could implement a client and utilize Apigee Edge Analytics Smart Docs API Reference to get the data you need to monitor error rates and, in turn, use that to trigger notifications via a monitoring and alerting solution outside of Apigee. Also see the Apigee Edge Doc on using the Analytics API and the Analytics Command Reference for additional examples of using this API.

This get's rather detailed, but as a proof of concept, you could use something like:

curl 'https://api.enterprise.apigee.com/v1/o/{orgName}/environments/prod/stats/apis?select=sum(message_count),sum(error_count)&sort=DESC&sortby=sum(message_count),sum(error_count)&t=agg_api&timeRange={fromDateTime}~{toDateTime}&timeUnit={aggregationTimeUnit}&tsAscending=true&limit={numDataPoints}'

Where:

{orgName} is your organization name

{envName} is your environment name, e.g. test or prod

{fromDateTime} is the UTC start point of your query in the format mm/dd/yyyy+hh:mm

{toDateTime} is the UTC end point of your query in the format mm/dd/yyyy+hh:mm

{aggregationTimeUnit} is the time span to aggregate the results over, e.g. minute, hour, day

{numDataPoints} is the maximum number of results you want to get back

This will provide you with two collections of data per api proxy in the org/env pair, one representing the sum of errors and one representing the sum of traffic. You can compute the error rate from those figures.

For example if you had one API Proxy in your org/env and you did this:

curl 'https://api.enterprise.apigee.com/v1/o/{orgName}/environments/{envName}/stats/apis?select=sum(message_count),sum(error_count)&sort=DESC&sortby=sum(message_count),sum(error_count)&t=agg_api&timeRange=11/13/2015+22:00~11/13/2015+22:05&timeUnit=minute&tsAscending=true&limit=5'

You would get something like this:

{
  "environments" : [ {
    "dimensions" : [ {
      "metrics" : [ {
        "name" : "sum(error_count)",
        "values" : [ {
          "timestamp" : 1447452000000,
          "value" : "0.0"
        }, {
          "timestamp" : 1447452060000,
          "value" : "0.0"
        }, {
          "timestamp" : 1447452120000,
          "value" : "15.0"
        }, {
          "timestamp" : 1447452180000,
          "value" : "12.0"
        }, {
          "timestamp" : 1447452240000,
          "value" : "0.0"
        } ]
      }, {
        "name" : "sum(message_count)",
        "values" : [ {
          "timestamp" : 1447452000000,
          "value" : "197668.0"
        }, {
          "timestamp" : 1447452060000,
          "value" : "203677.0"
        }, {
          "timestamp" : 1447452120000,
          "value" : "189224.0"
        }, {
          "timestamp" : 1447452180000,
          "value" : "200956.0"
        }, {
          "timestamp" : 1447452240000,
          "value" : "186267.0"
        } ]
      } ],
      "name" : "{yourAPIProxyName}"
    } ],
    "name" : "prod"
  } ],

You have 5 traffic data points, one for each minute between 22:00 UTC and 22:05 UTC on 13-Nov-2015 and 5 matching error data points. Note the timestamp in the response is in Unix epoch time format.

If you wrote a simple script or app to run this request in a continuous loop, say once per minute, incrementing the "from" and "to" parameters by a minute each time and computing the error % from the response you would have a running error % total that you could feed into a monitoring and alerting app, set thresholds and have your alerts triggered.

There are many permutations of options, as you can see in the smart docs and other reference docs earlier, including using filtering to make the results more specific and changing other query parameters to, for example adjust the "from" and "to" to be one minute apart and set the limit to 1. If you wanted it once per hour you could set the timeUnit parameter to hour. You would need to experiment to get the right set of parameters for an API request that might meet your needs.

And one last note, there is a delay in Analytics data because the raw data needs to be processed. We can't be precise about the duration of this delay, it is typically of the order of 5 to 10 minutes but there are multiple factors that influence that delay.

Report Inappropriate Content

I agree on the answer from @mschreuder. This is definitely possible with analytics and polling analytics API. In addition to this solution, you could also enable your API on Apigee with Webhooks. So, that you can programmatically trigger events that execute certain logic such as increasing, sending emails or resetting counters. You can also leverage technologies like BaaS or even S3 to store these events in async fashion, so the impact on latency is minimal. I have a working example of this model sending notifications to Slack and Pagerduty for a hackday, which essentially consists of an API Proxy enabled with Node.js that intercepts responses after they are sent to the client.

Apigee Webhooks Example #hackday

Hope it helps!