How to Verify Zookeeper Failover of Multi-DC Priva...

imesh · ‎05-01-2019

Apigee Private Cloud production deployments can be created according to a collection of topologies recommended by Apigee:

9-node Single data center topology
13-node single data center topology
12-node multiple data center topology

Out of above, 12-node multiple data center topology can be used for making Apigee API Management system available in two geographical regions:

Image reference: https://docs.apigee.com/private-cloud/v4.19.01/installation-topologies#12hostclusteredinstallation

All Apigee recommended topologies have been designed by providing high availability for all runtime components such as Routers, Message Processors, Cassandra, Zookeeper and Qpid within a data center. Hence, even if one instance of each component becomes unavailable, it would not affect the availability of the runtime or analytics data.

In multi-region deployments data store components such as Cassandra and Zookeeper would have higher availability due to the additional number of instances and data replication across data centers. As you may already know Cassandra stores runtime data such as API keys, OAuth tokens, KVM entries, caches, and quota counters which would directly impact API proxy latency. Hence, Cassandra nodes would always need to be available in each region for optimal performance of the API gateway.

However, if all Zookeeper nodes of one region completely become unavailable, it would not affect runtime traffic as Zookeeper does not store any runtime data. However, it would make Management Server in the given region unavailable. Nevertheless, we could fix that by updating that particular Management Server to point to the Zookeeper nodes in the other region until the failed Zookeeper nodes are recovered.

In this article I will explain how this process can be verified using a sample 12-node two data center deployment.

1. First, create a 12-node two data center deployment according to the installation guide: https://docs.apigee.com/private-cloud/v4.19.01/installation-topologies#12hostclusteredinstallation

2. Now, verify the status of both Management Servers in each region:

curl -i -u username:password http://{dc-1-managmeent-server-host}:8080/v1/organizations
curl -i -u username:password http://{dc-2-managmeent-server-host}:8080/v1/organizations

In this topology you may note that Management Server in each region only connects to ZooKeeper nodes in the same datacenter. This configuration is given by the ZK_CLIENT_HOSTS parameter in the silent configuration file used by the installation:

https://docs.apigee.com/private-cloud/v4.19.01/edge-configuration-file-reference

This configuration can be seen on the following file:

ssh {dc-1-management-server-host} cat /opt/apigee/edge-management-server/conf/zookeeper.properties | grep connection.string

# zookeeper connection string. format:connection.string={dc-1-zk-host-1}:2181,{dc-1-zk-host-2}:2181,{dc-1-zk-host-3}:2181/

3. Now, stop all three Zookeeper nodes in DC-2 and check the status of the Management Server in that region:

ssh {apigee-zookeeper-node}
apigee-service apigee-zookeeper stop
curl -i -u username:password http://{dc-2-managmeent-server-host}:8080/v1/organizations

You may see a response something similar to the following:

HTTP/1.1 500 Server Error
Date: Thu, 02 May 2019 02:39:34 GMT
X-Apigee.fault-code: zookeeper.ErrorCheckingPathExistence
Content-Type: application/json
X-Apigee.user: <masked>
X-Apigee.organization: null
Date: Thu, 02 May 2019 02:39:36 GMT
Content-Length: 263
{
  "code" : "zookeeper.ErrorCheckingPathExistence",
  "message" : "Error while checking path existence for path : /organizations",
  "contexts" : [ ],
  "cause" : {
    "message" : "KeeperErrorCode = ConnectionLoss for /organizations",
    "contexts" : [ ]
  }
}

In DC-2 Management Server logs you may see the following entries:

2019-05-02 01:17:09,981 org: env: target: contextId: action: CuratorFramework-0 ERROR o.a.c.ConnectionState - ConnectionState.checkTimeouts() : Connection timed out for connection string ({dc-2-zk-host-1}:2181,{dc-2-zk-host-2}:2181,{dc-2-zk-host-3}:2181) and timeout (3000) / elapsed (3000)

4. Now, we could change the configuration of DC-2 Management Server to connect to the ZooKeeper nodes in DC-1 using the below configuration.

File:

/opt/apigee/customer/application/management-server.properties

Configuration value:

conf/zookeeper.properties+connection.string={dc-1-zk-node-1}:2181,{dc-1-zk-node-2}:2181,{dc-1-zk-node-3}:2181/

5. Now restart the Management Server in DC-2 and wait until it is become ready:

apigee-service edge-management-server restart

6. Once DC-2 Management Server is restarted, execute following curl command to verify the status:

curl -i -u username:password http://{dc-2-managmeent-server-host}:8080/v1/organizations

How to Verify Zookeeper Failover of Multi-DC Private Cloud Deployments