Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments. In on-premises Apigee environments, Anthos provides a cluster with consistent Kubernetes versioning, components, and release cadence with GKE in GCP and Anthos Multi-cloud in AWS and Azure.
One of the challenges faced by customers is the mismatch between the cadence of Apigee Hybrid, Anthos releases and their upgrade cycle. It is not uncommon for customers to run versions of Anthos and Apigee which are either EOL or going EOL soon. Customers who run on older versions of Apigee and Anthos come to Apigee Support with issues with their platform.
This document captures the best practices for upgrading Apigee Hybrid on Anthos (Bare Metal, VMWare and Multi-Cloud like AWS)
Regardless of the version of Apigee being used, there are some general guidelines which will reduce the risks associated with upgrades. This section discusses what to prepare and do before an upgrade should happen. The better the preparations are, the smoother the upgrade will go. Consider this as a checklist and tick all necessary items before moving on with the actual upgrade. Once the upgrade is in progress, it is hard to change things or interact with the system. If things do not go as intended it is important to have an action plan at hand on what to do to recover from any potential failure.
Apigee has added great enhancements to the upgrade process in version 1.8. This guide will mainly focus on Apigee versions 1.5, 1.6 and 1.7.
The Anthos upgrade process is an in-place, rolling upgrade, and proceeds one node at a time to avoid disruptions. The node is put in maintenance mode before upgrading. There is also an option for Blue/Green upgrade where you set up a parallel user cluster with a new version of Anthos connecting to the same Admin cluster followed by Admin cluster upgrade. We recommend that you establish a cadence to upgrade the Anthos Cluster so that you don’t end up with an unsupported version of Anthos.
We recommend that you backup your clusters regularly to ensure your snapshot data is relatively current. Adjust the rate of backups to reflect the frequency of significant changes to your clusters. Starting with Anthos release 1.9, you can use the CLI command (“bmctl backup cluster”) to perform the backup.
We recommend that your cluster HA control plane have a minimum of three nodes. During an upgrade, you may want to consider adding additional node(s) to provide extra capacity.
Before any upgrade, be sure to read the release notes so that you are aware of what’s changed since your last upgrade, including any security fixes and known issues.
Automation improves software deployment efficiency, while a Git-based workflow can help you fix issues in production quickly, even with complex software with a large team involved.
In addition, when planning for the overall infrastructure upgrade or maintenance strategy, you may want to consider an in-place hardware and OS upgrade process in addition to the Anthos software upgrade. For example, before performing a hardware and OS upgrade, configure the worker node in maintenance mode in the cluster control plane so that applications are gracefully drained and scheduled on other nodes in the cluster.
Since Anthos upgrades demand additional resources, we recommend checking if the cluster can handle these additional resource requests.
As a rule of thumb for in-place upgrade, it will be:
If upgrading an Anthos user cluster with 2 node pools, where each node has 8 vCPUs and 32GB or RAM configured, the upgrade procedure will consume an additional:
In case of Blue/Green deployment, it will be an exact replica of the resource required in the original cluster.
Apigee Hybrid upgrade is in-place rolling upgrade with one pod at a time. The upgrade should be non disruptive and should not cause any downtime. However it's essential to establish upgrade cadence to ensure that you are not running an unsupported version of Apigee Hybrid. Another approach is Blue/Green deployment where you create a parallel cluster and expand cassandra across the regions.
All the components except for Cassandra are stateless and do not persist any data. Backup and restoration is not necessary for those components. During recovery, reinstallation of those components using the existing overrides is sufficient.
Apigee Hybrid is dependent on the right Anthos and ASM version so version compatibility is important. The compatibility matrix can be found here. It is important to review and version compatibility and create an upgrade plan accordingly. When upgrading following constraints needs to be considered :
Before any upgrade, you must review any release notes document to ensure that you are aware of any known issues.
Apigee also installs Cert-Manager under cert-manager namespace. Sometimes a version of Anthos ships their cert-manager in the kube-system namespace and that may conflict with the Apigee’s certmanager. The installation should have only one version of cert-manager.
In order to check that you can check the clusterissuer.
kubectl get clusterissuer --kubeconfig {KUBECONFIG} |
Please make sure that the cluster issuer is Apigee.
In case you are going for Blue/Green Upgrade following consideration needs to be done for Load Balancers
In the case of the Blue/Green upgrade pattern, It is important to plan accordingly and make sure the environment and the applications are ready and prepared upfront. Some of the key prerequisite needs to be addressed -
There are different ways on how to execute such upgrades, such as creating Apigee deployment with higher version on a separate cluster and joining to cassandra ring (blue/green upgrade) or upgrading the existing Apigee deployment (in-place upgrade).
Typically Apigee uses Workload deployment sets to manage the workload distribution therefore the impact is minimum.
The following table describes disruption introduced during an in place Apigee upgrade:
Function |
Blue/Green Upgrade |
In-Place Upgrade |
API Access |
Not Affected |
Not Affected |
Publish API Proxies |
Not Affected |
Not advised to push changes |
Platform Operations |
Not Affected |
Not Advised to perform any platform operation during upgrades |
Peak Traffic |
Not Affected |
Not advised to upgrade during peak traffic |
While it is common to think of Apigee upgrades to happen on the same cluster (in-place), there is also the possibility of doing a so-called blue-green upgrade. We speak of blue-green Apigee upgrades, when a completely new and empty cluster is installed, side by side to the original cluster and joining to the cassandra ring (blue/green upgrade) . Once this is achieved, usually the original cluster will be decomissioned.
In such a case a blue-green approach will be the preferred approach in order to guarantee minimal application impact.
Pro:
Cons:
Most upgrades are done as in-place upgrades since it seems they have obvious advantages.
If there are multiple environments such as test, dev and production, it is highly recommended to start with the least critical environment (for example: test) and verify functionality. Once successful, move on to the next least critical environment. This enables you to move from one criticality to the next and always verify the upgrade itself and the workloads functioning properly.
Pro:
Con:
$APIGEECTL_HOME/apigeectl apply -f overrides/overrides.yaml |
You can follow this document to install a new anthos cluster. Generally Anthos cluster installation has following sequence :
Sections below discuss the configuration files that you can refer to install Anthos Cluster.
Node labels are added as part of configuration. Please check if that is properly added. You can follow the instructions to add a label.The apigee-data and apigee-runtime nodes need to be labeled appropriately. Follow the instructions here to label nodes.
The new cluster has all the proxies and other apigee resources and they are synched during the cassandra sync process. With a Load balancer in place , this can be tested easily.
Follow this document to decommission the cluster.
In this scenario, a customer is running Apigee Hybrid 1.5.x and wants to upgrade to the 1.7x version of Apigee. When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components.
A Blue/Green Upgrade pattern was suggested to the customer.The upgrade sequence was created and suggested to the customer.
Anthos |
ASM |
Apigee |
Note |
||
DC1 & DC2 (existing) |
|||||
Today |
1.8.x |
1.7.x |
1.5.x |
Current versions |
|
Step 1 |
1.8.x |
1.9.x |
1.5.x |
Upgrade ASM from 1.7 to 1.9 on existing clusters to prepare for compatibility with Apigee hybrid 1.6 |
|
Step 2 |
1.8.x |
1.9.x |
1.6.x |
Upgrade Apigee hybrid from 1.5 to 1.6 on existing clusters, to allow for expanding to new clusters in DC3 & DC4 |
|
DC3 & DC4 (new) |
|||||
Step 3 |
1.10.x |
1.12 |
1.6.x |
Install the same version of Apigee hybrid into new DCs, joining Cassandra to the existing ring of DC1 & DC2 |
|
Step 4 |
Decomission and disconnect DC1 & DC2 |
||||
Step 5 |
1.10.x |
1.12 |
1.7.x |
Upgrade Apigee hybrid to 1.7 |
|
End state |
TBD |
TBD |
TBD |
Upgrade all software to latest versions available at the time |
Fig : Blue/Green Deployment for Apigee Hybrid
In this scenario, a customer is running Apigee Hybrid 1.5.x and wants to upgrade to the 1.7x version of Apigee. When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components. In this case we are joining version 1.6.x and 1.5.x versions of Cassandra.
Anthos |
ASM |
Apigee |
Note |
|
DC1 & DC2 (existing) |
||||
Today |
1.6x |
1.8.x |
1.5.x |
Current versions |
DC3 & DC4 (new) |
||||
Step 2 |
1.10.x |
1.12 |
1.6.x |
Install the N+1 of Apigee hybrid into new DCs(With an Exception), Join Cassandra to the existing ring of DC1 & DC2 |
Step 3 |
Decommission and disconnect DC1 & DC2 |
|||
Step 5 |
1.10.x |
1.12 |
1.7.x |
Upgrade Apigee hybrid to 1.7 |
End state |
TBD |
TBD |
TBD |
Upgrade all software to latest versions available at the time |
Upgrade preparations checklist:
kubectl get nodes --kubeconfig {KUBECONFIG} |
Look at the VERSION and AGE to figure out whether the node is upgraded.
If the node is not ready, try to login the node, look at /var/log/startup.log and
/var/log/cloud-init-output.log
kubectl get pods --kubeconfig {KUBECONFIG} -A |
Look at the AGE to figure out whether the pod is upgraded.
kubectl logs <cluster api controller pod name> -c vsphere-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -A -n {USER_CLUSTER_NAME} kubectl logs {cluster api controller pod name} -c clusterapi-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -n {USER_CLUSTER_NAME} |
NOTE: The above commands are for the worker nodes. If you want to inspect the controller log for status of the user cluster control plane nodes, replace the namespace with kube-system
kubectl logs <onprem user cluster controller pod name> -c onprem-user-cluster-controller --kubeconfig {admin kubeconfig} -A -n kube-system |
These nodes have been cordoned and scheduling has been disabled. You can list all the nodes and check the SchedulingDisabled term in the STATUS column.
E.g., to see only nodes that are being drained.
kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} | grep "SchedulingDisabled" |
To continuously “watch” all the nodes:
kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} –watch |
You can get the cluster API controller log for the node {NODE_NAME}. First, get the name of the cluster API controller pod.
kubectl --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} get pods -n {USER_CLUSTER_NAME} | grep clusterapi-controllers |
Then get the log of that pod and filter with the node name*.
kubectl logs {CLUSTER_API_POD_NAME} -c vsphere-controller-manager --kubeconfig {ADMIN_KUBECONFIG} -n {USER_CLUSTER_NAME} | grep "{NODE_NAME}" |
* For clusters with the DHCP mode, node names are the same as the “machine” names. For clusters using static IPs. You will need to map the node name to the machine name to search the log.
If there are any draining errors, you should see a log message “Failed to completely drain node” in the log you found in the previous step. This log line will include the exact cause of the draining problem.