Forming an API Monitoring Strategy - Where to Star...

dcouldwell · ‎02-04-2016

Monitoring your API's health is key to maintaining a trusted, reliable, and robust API program, and to quickly identifying and resolving issues. You can monitor both the proxy and underlying target endpoints.

When designing your API, consider how to monitor in a lightweight and maintainable fashion. Also, think about what role the API may take in monitoring underlying target health.

This article outlines the approach that the Customer Success team at Apigee takes when helping customers form a monitoring strategy.

Starting out

Ask yourself the following questions when you start to think about API Monitoring:

What are the requirements for monitoring API health?
- Is a simple ping enough?
- Are there certain resources that are critical?
- How deep does the monitoring need to be?
Are there requirements for monitoring the target endpoint health through the API?
- Are you looking to monitor both proxy and target health? Differentiating between proxy health vs target health can be key when diagnosing issues in production.
Which environments are important to have monitoring in place?
- Production is obvious but it could be just as important to monitor alpha, beta and dev integration environments.

Requirements

The main objectives of a monitoring strategy are:

Defining various request/response patterns that touch as many components as possible to test the health of the overall system.
Defining an external system that can execute these requests reliably and consistently. Ideally, this system should have the capability to run requests from multiple different data centres around the world.

A general best practice consists of:

Designing various specialised cheap-to-execute requests that monitor the health of target components and connectivity between the proxy and the target endpoint.
Using a selection of real API resources to assess the health of individual proxy components.

Common patterns

The following are some examples of resources we commonly use to fulfill the above requirements. The patterns described in this article are:

Ping sub resource -- A specialised sub resource exposed by the proxy to test proxy network connectivity and proxy deployment status.
Status Resource -- A specialised resource to test proxy-to-target network connectivity and assess target API health.
Using Real Requests -- Using the existing API resources to check the health of the system.

Ping sub resource

This is a specialised sub resource exposed by the proxy to test proxy network connectivity and proxy deployment status. The proxy does not hit any target APIs in this scenario.

Although it could be implemented as first-class resource, it is recommended to implement at as a sub resource. So, each API Proxy bundle is instrumented by providing independent monitoring capabilities.

Example

Here is an example implementation:

Example Request

GET /customer/v1/ping
Accept: application/json

Example Response

HTTP/1.1 200 OK
Content-Type: application/json 
{
    "environment": "prod",
    "clientIp": "100.10.1.0",
    "api": "customer-v1",
    "verb": "GET",
    "responseTime": 20,
    "message": "pong"
}

Status resource

This is a specialised resource to test proxy-to-target network connectivity and assess target API health. It is exposed by both the proxy and target APIs, as follows:

A client request hits a proxy /status endpoint.
In turn, the proxy hits the status (or health) endpoints exposed by each target API -- see below.
Status endpoints for target APIs and components will need to do all internal testing necessary to report the health of that component.
Apigee responds in keeping with target responses, as follows:
- If all targets respond with success, Apigee responds with 200 OK. The response includes an array of objects containing health and timing information for each target system.
- If at least one target returns failure, Apigee responds with 500 Internal Server Error. The response includes an array of JSON objects containing health and timing information for target systems. It is important for the status resource to respond as soon as it understands that a particular target system is failing. In other words, if one system is failing, it shouldn't wait until all systems respond.

Example

Example request

GET /customer/v1/status
Accept: application/json

Example success response

HTTP/1.1 200 OK
Content-Type: application/json
[
    {
        "name": "customer-v1",
        "component" : "crm",
        "targetResponseTime": 350,
        "status": "ok",
        "response": ""
    },
    {
        "name": "customer-v1",
        "component" : "loyalty",
        "targetResponseTime": 500,
        "status": "ok",
        "response": ""
    }
]

Example failure response

HTTP/1.1 500 Internal Server Error
Content-Type: application/json
[
    {
        "name": "customer-v1",
        "component" : "crm",
        "targetResponseTime": 600,
        "status": "failure",
        "response": "unable to connect to customer database"
    },
    {
        "name": "customer-v1",
        "component" : "loyalty",   
        "targetResponseTime": 500,
        "status": "ok",
        "response": ""
    }  
]

While implementing this resource, you'll learn the quickest and cheapest route to understanding how each target system's health can be checked.

Target considerations

If a target API already exposes a status/health endpoint, use that.
If a target API cannot expose a status endpoint -- for example, the API is external and not in the team's control, you can use a simple (and cheap) GET request.
If the target system is not an API (for example, it is a database), the target system will need to expose commands specific to the system to monitor general connectivity between Apigee and this component. For example, mongodb has a db.serverStatus() command that returns quickly and does not impact MongoDB performance. The proxy /status endpoint can execute db.serverStatus() on mongo to report its status.

Security considerations

Protect the status resource if the response contains confidential data.
Consider masking to prevent unnecessary or internal information from leaking from endpoints when reporting errors, such as when database names occur in error strings.

Using real requests

This approach uses the existing API resources to check the health of the system. Because the tests are running on a production environment, be careful when choosing resources for this. Ideally data that is used by this resource will be isolated from all other system data. For example, in hotel API a new dummy hotel can be created within the system where monitoring can do reservations and cancellations without affecting real hotel availability.

Analytics

If you are using real requests for monitoring, and if APIs are protected by API keys or OAuth, create a new separate application for monitoring. That way, requests can be identified in analytics.

Regardless of the monitoring approach you take, the requests will still appear in any analytics report so you may want to consider adding something in the requests to be able to easily filter them out of any reporting.

Tools

There are a number of tools out there to help you monitor your API. Here's some of the tools we have used:

Apigee Health - https://health.apigee.com
Librato - https://www.librato.com/
Pingdom - https://www.pingdom.com/
Runscope - https://www.runscope.com/
API Metrics - http://apimetrics.io/
Uptime - https://github.com/fzaninotto/uptime

Summary

Think about what you're trying to monitor and why. Think about the cost of monitoring. Don't forget about the security of the resources you are exposing.

Report Inappropriate Content · ‎02-04-2016

Are you looking to monitor both proxy and target health? Differentiating between proxy health vs target health can be key when diagnosing issues production.

should be

Are you looking to monitor both proxy and target health? Differentiating between proxy health vs target health can be key when diagnosing issues in production.

to define various request/response patterns that touch as many components as possible to test the health of the overall system running in a production environment

should be to define various request/response patterns that touch as many components as possible to test the health of the overall system reason wouldn’t limit monitoring strategies based on environment. It could be just as important to monitor alpha, beta, and dev integration environments

designing various specialised cheap-to-execute requests that monitor the health of target components and connectivity between the proxy and the target services

should be

designing various specialised cheap-to-execute requests that monitor the health of target components and connectivity between the proxy and the target endpoint

reason service implies a certain type of endpoint resourcea should be resource reason cleaner and less confusing also not sure i like the design of the endpoint which would be considered poor restful design a more restful design would look like this /collection?action=ping or /collection/resource?action=ping same with the status endpoint would make more since to have an apis endpoint with an action=status (could default to that too)

Client request hit a proxy /status endpoint

should be

Client request hits a proxy /status endpoint

If all targets respond with success, Apigee responds with 200 OK with an array of JSON objects containing health and timing information for each target system.

should be

If all targets respond with success, Apigee responds with 200 OK with an array of objects containing health and timing information for each target system.

shouldn’t contextually constrain ourselves to JSON when xml is a valid restful response too I think a blurb on the effects of monitoring on analytics maybe necessary too

jonesfloyd · ‎02-04-2016

@Dom Couldwell, ping @docs here when you're finished with Steve's comments. Thanks!

benrodriguez · ‎01-22-2017

Thanks for posting Dom. My org is going through this process now. we are looking at these tools and integrating with legacy tools we have in place like servicenow and zendesk.

Report Inappropriate Content · ‎01-22-2017

@Ben Rodriguez - I'd also strongly recommend looking at StackDriver. For Health/Uptime Checks https://cloud.google.com/monitoring/alerts/uptime-checks.