API traffic readiness checks for Edge Private Clou...

ncardace · ‎06-27-2019

For Apigee Edge Private Cloud deployments (OPDK), it is possible to add (and remove) API processing capacity without the need to interrupt API traffic, i.e. with guaranteed zero-downtime and without any API client noticing any service interruption or transient errors, provided that

a) the correct capacity augmentation documented procedures are followed.

See here.

b) additional steps are added before traffic is allowed to flow to the newly added traffic components (either Routers, Message Processors or both types).

Routers are commonly deployed behind physical or software load-balancers (OPDK deployments in public cloud infrastructure), so Apigee Edge provides an healthcheck port to discover when they are up and ready to serve traffic so they can be added or removed from the corresponding Load Balancing pools.

This is extensively covered in the documentation here.

Note that the Router components have a default connection time-out logic while trying to connect to the Message Processors as part of an API client request and this configuration is documented here.

In short, after that timeout expires, the Router attempts to connect to another Message Processor for the same environment, if one is available. Otherwise, it returns an error.

To detect from the Load Balancer or from a Global Traffic Manager if the environment has at least one Router and Message Processor available, healthy and capable of processing the API calls within a given Organization, the following HTTP GET call can be performed periodically (say, every 1 second):

HTTP GET http://{routerIP}:15999/{org}__{env}

An expected successful response is HTTP 200 OK.

Message Processors are executing the policy management and configured API management steps.
While adding a Message Processor to increase the footprint of the available capacity in any given planet, region, organization and environment (or multiple of them at the same time), it will be necessary to:

1. place the new Linux server in the same subnet as the other runtime components

2. configure the network and firewalls according to the rules documented here.

3. temporarily block the TCP port 8998 (using iptables) of the node until it is ready to process the API traffic (keep reading for the actual details on how to check for this condition to happen).

3. install the new Message Processor following the document here.

4. start the new Message Processor

5. implement a readiness check, i.e. a strong check that will return the information with 100% accuracy and the correct timing that the MP is ready to process traffic, i.e. it is up, it has fully deployed the API proxies and will process the new API calls normally.

We have two equivalent approaches for such readiness check that will signal when the MP are ready to process API traffic.

a) During bootstrap of the MP:

HTTP GET http://{MP-IP}:8082/v1/servers/self/up

the response of this check will gradually progress from each of the following states to the next:

- TCP connection refused (port closed)

- HTTP 503 response code with text payload: "Service not up yet"

- HTTP 200 OK response with text payload: "true"

Example command line check:

curl -v http://{MP-IP}:8082/v1/servers/self/up

b) During bootstrap of the MP:

HTTP GET http://{MP-IP}:8998/

with the two mandatory request headers:

"X-Apigee.heartbeat: true"

and

"Connection-keepalive: true"

the response of this check will gradually progress from each of the following states to the next:

- TCP connection refused (port closed)

- HTTP 599 response code with text payload: "Server Not Ready"

- HTTP 200 OK response with text payload: "Server Ready"

Example command line check:

curl -v -H "X-Apigee.heartbeat: true" -H "Connection-keepalive: true" http://{MP-IP}:8998/

You only need option a) or b), they are equivalent. In both cases the last step indicates the server ready.

6. remove the iptables rule added at step 3 so that port 8998 in the Message Processor is unblocked and API traffic can start flowing normally, as the component is ready to process API traffic in all organizations and environments wired to it.

Note: the sequence is opposite during a graceful shutdown, for example when you need to decommission capacity.

This approach is accurate for OPDK v 18.05 and 19.01. Note that for older releases the checks work differently.

API traffic readiness checks for Edge Private Cloud