Purpose of this document

rajeshmi · ‎11-15-2022

Purpose of this document

Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments. In on-premises Apigee environments, Anthos provides a cluster with consistent Kubernetes versioning, components, and release cadence with GKE in GCP and Anthos Multi-cloud in AWS and Azure.

One of the challenges faced by customers is the mismatch between the cadence of Apigee Hybrid, Anthos releases and their upgrade cycle. It is not uncommon for customers to run versions of Anthos and Apigee which are either EOL or going EOL soon. Customers who run on older versions of Apigee and Anthos come to Apigee Support with issues with their platform.

This document captures the best practices for upgrading Apigee Hybrid on Anthos (Bare Metal, VMWare and Multi-Cloud like AWS)

Upgrade best practices

Regardless of the version of Apigee being used, there are some general guidelines which will reduce the risks associated with upgrades. This section discusses what to prepare and do before an upgrade should happen. The better the preparations are, the smoother the upgrade will go. Consider this as a checklist and tick all necessary items before moving on with the actual upgrade. Once the upgrade is in progress, it is hard to change things or interact with the system. If things do not go as intended it is important to have an action plan at hand on what to do to recover from any potential failure.

Apigee has added great enhancements to the upgrade process in version 1.8. This guide will mainly focus on Apigee versions 1.5, 1.6 and 1.7.

Apigee Hybrid upgrade on Anthos

Plan for the Upgrade

Anthos Considerations

Do you want to establish a cadence for upgrades to ensure a smooth operation?

The Anthos upgrade process is an in-place, rolling upgrade, and proceeds one node at a time to avoid disruptions. The node is put in maintenance mode before upgrading. There is also an option for Blue/Green upgrade where you set up a parallel user cluster with a new version of Anthos connecting to the same Admin cluster followed by Admin cluster upgrade. We recommend that you establish a cadence to upgrade the Anthos Cluster so that you don’t end up with an unsupported version of Anthos.

Do you want to back up before upgrading?

We recommend that you backup your clusters regularly to ensure your snapshot data is relatively current. Adjust the rate of backups to reflect the frequency of significant changes to your clusters. Starting with Anthos release 1.9, you can use the CLI command (“bmctl backup cluster”) to perform the backup.

Does your environment use a high-availability design?

We recommend that your cluster HA control plane have a minimum of three nodes. During an upgrade, you may want to consider adding additional node(s) to provide extra capacity.

Have you reviewed the latest Anthos release notes?

Before any upgrade, be sure to read the release notes so that you are aware of what’s changed since your last upgrade, including any security fixes and known issues.

Have you adopted Infrastructure as Code best practices and adopted git-based workflows?

Automation improves software deployment efficiency, while a Git-based workflow can help you fix issues in production quickly, even with complex software with a large team involved.

In addition, when planning for the overall infrastructure upgrade or maintenance strategy, you may want to consider an in-place hardware and OS upgrade process in addition to the Anthos software upgrade. For example, before performing a hardware and OS upgrade, configure the worker node in maintenance mode in the cluster control plane so that applications are gracefully drained and scheduled on other nodes in the cluster.

Resource Requirements

Since Anthos upgrades demand additional resources, we recommend checking if the cluster can handle these additional resource requests.

As a rule of thumb for in-place upgrade, it will be:

+1 VM per admin cluster upgrade
+1 VM per node pool per user cluster upgrade

If upgrading an Anthos user cluster with 2 node pools, where each node has 8 vCPUs and 32GB or RAM configured, the upgrade procedure will consume an additional:

16vCPUs
64GB of RAM.
VM Disk space + 64GB of VSWAP

In case of Blue/Green deployment, it will be an exact replica of the resource required in the original cluster.

Apigee Considerations

Have you established a cadence for upgrades to ensure a smooth operation?

Apigee Hybrid upgrade is in-place rolling upgrade with one pod at a time. The upgrade should be non disruptive and should not cause any downtime. However it's essential to establish upgrade cadence to ensure that you are not running an unsupported version of Apigee Hybrid. Another approach is Blue/Green deployment where you create a parallel cluster and expand cassandra across the regions.

Do you want to back up before upgrading?

All the components except for Cassandra are stateless and do not persist any data. Backup and restoration is not necessary for those components. During recovery, reinstallation of those components using the existing overrides is sufficient.

Have you reviewed the Platform/Version Compatibility?

Apigee Hybrid is dependent on the right Anthos and ASM version so version compatibility is important. The compatibility matrix can be found here. It is important to review and version compatibility and create an upgrade plan accordingly. When upgrading following constraints needs to be considered :

You can upgrade to only the N+1 version for Anthos. For example if you have 1.8 version of Anthos and wants to upgrade to 1.10 version, you have to first upgrade to 1.9 and then upgrade to 1.10
The region expansion for Apigee Hybrid should be with the same version of Hybrid. What this means is that if you have your primary region running in say version 1.6 and you want to expand to another region, you should install a region with version 1.6.

Have you reviewed the latest Apigee release notes?

Before any upgrade, you must review any release notes document to ensure that you are aware of any known issues.

Cert-Manager

Apigee also installs Cert-Manager under cert-manager namespace. Sometimes a version of Anthos ships their cert-manager in the kube-system namespace and that may conflict with the Apigee’s certmanager. The installation should have only one version of cert-manager.

In order to check that you can check the clusterissuer.

kubectl get clusterissuer --kubeconfig {KUBECONFIG}

Please make sure that the cluster issuer is Apigee.

Load Balancers

In case you are going for Blue/Green Upgrade following consideration needs to be done for Load Balancers

Load Balancers - If manual load balancers are used, the required setting needs to be done for the new cluster.
DNS entries and Certificates - The new cluster’s ingress may require new DNS entries and/or certificates. These need to be provisioned.

Firewalls

In the case of the Blue/Green upgrade pattern, It is important to plan accordingly and make sure the environment and the applications are ready and prepared upfront. Some of the key prerequisite needs to be addressed -

All Firewall rules are well established and the new cluster can access google resources.
Firewall rules between nodes to establish cassandra communication across clusters. Port 7001 needs to be open for cassandra communication.
New Cluster should create resources with the configurations as planned or the same as existing clusters.
Required VIP - New cluster would need VIP for admin and user cluster.

Upgrade Sequence

There are different ways on how to execute such upgrades, such as creating Apigee deployment with higher version on a separate cluster and joining to cassandra ring (blue/green upgrade) or upgrading the existing Apigee deployment (in-place upgrade).

Traffic Disruption during Upgrade

Typically Apigee uses Workload deployment sets to manage the workload distribution therefore the impact is minimum.

The following table describes disruption introduced during an in place Apigee upgrade:

Function	Blue/Green Upgrade	In-Place Upgrade
API Access	Not Affected	Not Affected
Publish API Proxies	Not Affected	Not advised to push changes
Platform Operations	Not Affected	Not Advised to perform any platform operation during upgrades
Peak Traffic	Not Affected	Not advised to upgrade during peak traffic

Upgrade Approach

In-Place vs Blue/Green Upgrades

While it is common to think of Apigee upgrades to happen on the same cluster (in-place), there is also the possibility of doing a so-called blue-green upgrade. We speak of blue-green Apigee upgrades, when a completely new and empty cluster is installed, side by side to the original cluster and joining to the cassandra ring (blue/green upgrade) . Once this is achieved, usually the original cluster will be decomissioned.

In such a case a blue-green approach will be the preferred approach in order to guarantee minimal application impact.

Pro:

Features risk-averse upgrade approaches
Easy to “fall back” since the original Apigee deployment remains untouched
Cluster feature set / configuration can be “cleaned up”
New Anthos/Kubernetes versions can easily be verified.

Cons:

Installation and migration can take longer than an in-place upgrade depending on the number of version hops. If it’s beyond 2 versions, this Con does not apply.
Higher resource usage / availability is required during the migration phase. Once the switch over is completed the old cluster can be decommissioned.
Can involve complex load balancer config to route traffic between new and existing cluster

Most upgrades are done as in-place upgrades since it seems they have obvious advantages.

If there are multiple environments such as test, dev and production, it is highly recommended to start with the least critical environment (for example: test) and verify functionality. Once successful, move on to the next least critical environment. This enables you to move from one criticality to the next and always verify the upgrade itself and the workloads functioning properly.

Pro:

Less resource consumption
Reduced effort
No added load balancer complexity
Kubernetes handles application availability and failover within the cluster

Con:

Higher risk in case of failure
Difficult to fall back in case of an upgrade error

Anatomy of Blue/Green Upgrade Process

Prepare Old Cluster

Update hostNetwork=true in the existing clusters (Anthos 1.6). This is required for inter-node communication between the New Anthos cluster and existing cluster.

Change hostNetwork=true in overrides file. If this is not present, add in the cassandra section.
Run the following command to update the hostNetwork

$APIGEECTL_HOME/apigeectl apply -f overrides/overrides.yaml

Install New Version of Anthos on New Cluster

Install new version of Anthos

You can follow this document to install a new anthos cluster. Generally Anthos cluster installation has following sequence :

Install the Admin workstation
Install the Admin Cluster.
Install all User Clusters

Sections below discuss the configuration files that you can refer to install Anthos Cluster.

Copy the existing admin-cluster.yaml from old cluster to new Cluster and make relevant changes.
Copy user-cluster.yaml from the old cluster and make appropriate changes. The nodepool section needs to be modified.
Copy the ip block sections from old cluster to new cluster and modify as required.
Create Admin Cluster and User Cluster.

Firewall Checks

Check if 7001 port is open between New Cluster and Old User Cluster nodes.

NodeLabels

Node labels are added as part of configuration. Please check if that is properly added. You can follow the instructions to add a label.The apigee-data and apigee-runtime nodes need to be labeled appropriately. Follow the instructions here to label nodes.

Expand Cassandra Cluster between existing cluster and new Cluster.

Copy the existing apigee configuration file from existing cluster to new cluster and make relevant changes. Ensure that instanceid is different from existing cluster.
Copy the existing service accounts, certs from old cluster to new cluster and put them in relevant directory under hybrid-files.
Install Apigee 1.7 in the new cluster. Follow all steps till step 8 of part 2 from the document here.
Follow the instructions here to expand cassandra
Execute step 9 to finish installation.
Move traffic to the new cluster.

Testing and Migration

The new cluster has all the proxies and other apigee resources and they are synched during the cassandra sync process. With a Load balancer in place , this can be tested easily.

Decommission Existing Cluster

Follow this document to decommission the cluster.

Anatomy of In-Place Upgrade Process

Apply your overrides to upgrade Cassandra
Apply your overrides to upgrade Telemetry components and check completion
Bring up Redis components.
Apply your overrides to upgrade the org-level components (MART, Watcher and Apigee Connect) and check completion.
Apply your overrides to upgrade your environments. You have two choices:

Environment by environment: Apply your overrides to one environment at a time and check completion. Repeat this step for each environment:
All environments at one time: Apply your overrides to all environments at once and check completion:

Upgrade Scenarios

Scenario example 1

In this scenario, a customer is running Apigee Hybrid 1.5.x and wants to upgrade to the 1.7x version of Apigee. When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components.

A Blue/Green Upgrade pattern was suggested to the customer.The upgrade sequence was created and suggested to the customer.


	Anthos	ASM	Apigee	Note
	DC1 & DC2 (existing)
Today	1.8.x	1.7.x	1.5.x	Current versions
Step 1	1.8.x	1.9.x	1.5.x	Upgrade ASM from 1.7 to 1.9 on existing clusters to prepare for compatibility with Apigee hybrid 1.6
Step 2	1.8.x	1.9.x	1.6.x	Upgrade Apigee hybrid from 1.5 to 1.6 on existing clusters, to allow for expanding to new clusters in DC3 & DC4
	DC3 & DC4 (new)
Step 3	1.10.x	1.12	1.6.x	Install the same version of Apigee hybrid into new DCs, joining Cassandra to the existing ring of DC1 & DC2
Step 4	Decomission and disconnect DC1 & DC2
Step 5	1.10.x	1.12	1.7.x	Upgrade Apigee hybrid to 1.7
End state	TBD	TBD	TBD	Upgrade all software to latest versions available at the time

Fig : Blue/Green Deployment for Apigee Hybrid

Scenario example 2

In this scenario, a customer is running Apigee Hybrid 1.5.x and wants to upgrade to the 1.7x version of Apigee. When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components. In this case we are joining version 1.6.x and 1.5.x versions of Cassandra.

	Anthos	ASM	Apigee	Note
	DC1 & DC2 (existing)
Today	1.6x	1.8.x	1.5.x	Current versions
	DC3 & DC4 (new)
Step 2	1.10.x	1.12	1.6.x	Install the N+1 of Apigee hybrid into new DCs(With an Exception), Join Cassandra to the existing ring of DC1 & DC2
Step 3	Decommission and disconnect DC1 & DC2
Step 5	1.10.x	1.12	1.7.x	Upgrade Apigee hybrid to 1.7
End state	TBD	TBD	TBD	Upgrade all software to latest versions available at the time

Upgrade Checklist

Upgrade preparations checklist:

Prepare backups for the user and admin clusters.
Prepare Cassandra Backups

Check cluster capacity (CPU, Memory and %Ready Times)
Check the user cluster utilization and available resources
Check Firewall ports and access to Google resources.
Availability of VIPs for Control Plane and Load Balancers.

Troubleshooting Commands

Check whether the nodes are ready

kubectl get nodes --kubeconfig {KUBECONFIG}

Look at the VERSION and AGE to figure out whether the node is upgraded.

If the node is not ready, try to login the node, look at /var/log/startup.log and

/var/log/cloud-init-output.log

Check whether the pods are running.

kubectl get pods --kubeconfig {KUBECONFIG} -A

Look at the AGE to figure out whether the pod is upgraded.

Check whether there is error in cluster api controller log

kubectl logs <cluster api controller pod name> -c vsphere-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -A -n {USER_CLUSTER_NAME}

kubectl logs {cluster api controller pod name} -c clusterapi-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -n {USER_CLUSTER_NAME}

NOTE: The above commands are for the worker nodes. If you want to inspect the controller log for status of the user cluster control plane nodes, replace the namespace with kube-system

Check whether there is error in on prem user cluster controller log

kubectl logs <onprem user cluster controller pod name> -c onprem-user-cluster-controller --kubeconfig {admin kubeconfig} -A -n kube-system

How can I find out what nodes are currently being drained?

These nodes have been cordoned and scheduling has been disabled. You can list all the nodes and check the SchedulingDisabled term in the STATUS column.

E.g., to see only nodes that are being drained.

kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} | grep "SchedulingDisabled"

To continuously “watch” all the nodes:

kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} –watch

A Node has been in the status of draining for a very long time. How can I find out what’s going on?

You can get the cluster API controller log for the node {NODE_NAME}. First, get the name of the cluster API controller pod.

kubectl --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} get pods -n {USER_CLUSTER_NAME} | grep clusterapi-controllers

Then get the log of that pod and filter with the node name*.

kubectl logs {CLUSTER_API_POD_NAME} -c vsphere-controller-manager --kubeconfig {ADMIN_KUBECONFIG} -n {USER_CLUSTER_NAME} | grep "{NODE_NAME}"

* For clusters with the DHCP mode, node names are the same as the “machine” names. For clusters using static IPs. You will need to map the node name to the machine name to search the log.

Locate draining errors in the cluster API controller log.

If there are any draining errors, you should see a log message “Failed to completely drain node” in the log you found in the previous step. This log line will include the exact cause of the draining problem.

Google Cloud - Best Practices for Apigee Hybrid Upgrade on Anthos

Purpose of this document

Upgrade best practices

Apigee Hybrid upgrade on Anthos

Plan for the Upgrade

Anthos Considerations

Apigee Considerations

Cert-Manager

Load Balancers

Firewalls

Upgrade Sequence

Traffic Disruption during Upgrade

Upgrade Approach

In-Place vs Blue/Green Upgrades

Anatomy of Blue/Green Upgrade Process

Prepare Old Cluster

Install New Version of Anthos on New Cluster

Firewall Checks

NodeLabels

Expand Cassandra Cluster between existing cluster and new Cluster.

Testing and Migration

Decommission Existing Cluster

Anatomy of In-Place Upgrade Process

Upgrade Scenarios

Scenario example 1

Scenario example 2

Upgrade Checklist

Troubleshooting Commands

Check whether the nodes are ready

Check whether the pods are running.

Check whether there is error in cluster api controller log

Check whether there is error in on prem user cluster controller log

How can I find out what nodes are currently being drained?

A Node has been in the status of draining for a very long time. How can I find out what’s going on?

Locate draining errors in the cluster API controller log.