Google Cloud - Best Practices for Apigee Hybrid Upgrade on Anthos

Purpose of this document

Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments. In on-premises Apigee environments, Anthos provides a cluster with consistent Kubernetes versioning, components, and release cadence with GKE in GCP and Anthos Multi-cloud in AWS and Azure. 

One of the challenges faced by customers is the mismatch between the cadence of Apigee Hybrid, Anthos releases and their upgrade cycle. It is not uncommon for customers to run versions of Anthos and Apigee which are either EOL or going EOL soon. Customers who run on older versions of Apigee and Anthos come to Apigee Support with issues with their platform. 

This document captures the best practices for upgrading Apigee Hybrid on Anthos (Bare Metal, VMWare and Multi-Cloud like AWS) 

 

Upgrade best practices

Regardless of the version of Apigee being used, there are some general guidelines which will reduce the risks associated with upgrades. This section discusses what to prepare and do before an upgrade should happen. The better the preparations are, the smoother the upgrade will go. Consider this as a checklist and tick all necessary items before moving on with the actual upgrade. Once the upgrade is in progress, it is hard to change things or interact with the system. If things do not go as intended it is important to have an action plan at hand on what to do to recover from any potential failure.

Apigee has added great enhancements to the upgrade process in version 1.8. This guide will mainly focus on Apigee versions 1.5, 1.6 and 1.7. 

Apigee Hybrid upgrade on Anthos

Plan for the Upgrade

Anthos Considerations

  • Do you want to establish a cadence for upgrades to ensure a smooth operation? 

The Anthos upgrade process is an in-place, rolling upgrade, and proceeds one node at a time to avoid disruptions. The node is put in maintenance mode before upgrading. There is also an option for Blue/Green upgrade where you set up a parallel user cluster with a new version of Anthos connecting to the same Admin cluster followed by Admin cluster upgrade. We recommend that you establish a cadence to upgrade the Anthos Cluster so that you don’t end up with an unsupported version of Anthos.

  • Do you want to back up before upgrading? 

We recommend that you backup your clusters regularly to ensure your snapshot data is relatively current. Adjust the rate of backups to reflect the frequency of significant changes to your clusters. Starting with Anthos release 1.9, you can use the CLI command (“bmctl backup cluster”) to perform the backup. 

  • Does your environment use a high-availability design? 

We recommend that your cluster HA control plane have a minimum of three nodes. During an upgrade, you may want to  consider adding additional node(s) to provide extra capacity.

  • Have you reviewed the latest Anthos release notes? 

Before any upgrade, be sure to read the release notes so that you are aware of what’s changed since your last upgrade, including any security fixes and known issues. 

  • Have you adopted Infrastructure as Code best practices and adopted git-based workflows?

Automation improves software deployment efficiency, while a Git-based workflow can help you fix issues in production quickly, even with complex software with a large team involved.

In addition, when planning for the overall infrastructure upgrade or maintenance strategy, you may want to consider an in-place hardware and OS upgrade process in addition to the Anthos software upgrade. For example, before performing a hardware and OS upgrade, configure the worker node in maintenance mode in the cluster control plane so that applications are gracefully drained and scheduled on other nodes in the cluster.

  • Resource Requirements

Since Anthos upgrades demand additional resources, we recommend checking if the cluster can handle these additional resource requests.

As a rule of thumb for in-place upgrade,  it will be:

  • +1 VM per admin cluster upgrade
  • +1 VM per node pool per user cluster upgrade

If upgrading an Anthos user cluster with 2 node pools, where each node has 8 vCPUs and 32GB or RAM configured, the upgrade procedure will consume an additional:

  • 16vCPUs 
  • 64GB of RAM. 
  • VM Disk space + 64GB of VSWAP

In case of Blue/Green deployment, it will be an exact replica of the resource required in the original cluster.

Apigee Considerations

  • Have you established a cadence for upgrades to ensure a smooth operation? 

Apigee Hybrid upgrade is in-place rolling upgrade with one pod at a time. The upgrade should be non disruptive and should not cause any downtime. However it's essential to establish upgrade cadence to ensure that you are not running an unsupported version of Apigee Hybrid. Another approach is Blue/Green deployment where you create a parallel cluster and expand cassandra across the regions. 

 

  • Do you want to back up before upgrading? 

All the components except for Cassandra are stateless and do not persist any data. Backup and restoration is not necessary for those components. During recovery, reinstallation of those components using the existing overrides is sufficient.  

 

  • Have you reviewed the Platform/Version Compatibility?

Apigee Hybrid is dependent on the right Anthos and ASM version so version compatibility is important.  The compatibility matrix can be found here.  It is important to review and version compatibility and create an upgrade plan accordingly.  When upgrading following constraints needs to be considered :

  • You can upgrade to only the N+1 version for Anthos. For example if you have 1.8 version of Anthos and wants to upgrade to 1.10 version, you have to first upgrade to 1.9 and then upgrade to 1.10
  • The region expansion for Apigee Hybrid should be with the same version of Hybrid. What this means is that if you have your primary region running in say version 1.6 and you want to expand to another region, you should install a region with version 1.6. 

 

  • Have you reviewed the latest Apigee release notes? 

Before any upgrade, you must review any release notes document to ensure that you are aware of any known issues.

Cert-Manager

Apigee also installs Cert-Manager under cert-manager namespace. Sometimes a version of Anthos ships their cert-manager in the kube-system namespace and that may conflict with the Apigee’s certmanager. The installation should have only one version of cert-manager.


In order to check that you can check the clusterissuer.


kubectl get clusterissuer --kubeconfig {KUBECONFIG}


Please make sure that the cluster issuer is Apigee.


Load Balancers


In case you are going for Blue/Green Upgrade following consideration needs to be done for Load Balancers

  • Load Balancers - If  manual load balancers are used, the required setting needs to be done for the new cluster.
  • DNS entries  and Certificates - The new cluster’s ingress may require new DNS entries and/or certificates. These need to be provisioned.

Firewalls

In the case of the Blue/Green upgrade pattern, It is important to plan accordingly and make sure the environment and the applications are ready and prepared upfront. Some of the key prerequisite needs to be addressed -

  • All Firewall rules are well established and the new cluster can access google resources. 
  • Firewall rules between nodes to establish cassandra communication across clusters. Port 7001 needs to be open for cassandra communication.
  • New Cluster should create resources with the configurations as planned or the same as existing clusters.
  • Required VIP - New cluster would need VIP for admin and user cluster.

Upgrade Sequence


There are different ways on how to execute such upgrades, such as creating Apigee deployment with higher version on a separate cluster and joining to cassandra ring (blue/green upgrade) or upgrading the existing Apigee deployment (in-place upgrade).  


Traffic Disruption during Upgrade


Typically Apigee uses Workload deployment sets to manage the workload distribution therefore the impact is minimum. 

The following table describes disruption introduced during an in place Apigee upgrade:


Function

Blue/Green Upgrade

In-Place Upgrade

API Access

Not Affected

Not Affected

Publish API Proxies

Not Affected

Not advised to push changes

Platform Operations

Not Affected

Not Advised to perform any platform operation during upgrades

Peak Traffic

Not Affected

Not advised to upgrade during peak traffic


Upgrade  Approach


In-Place vs Blue/Green Upgrades


While it is common to think of Apigee upgrades to happen on the same cluster (in-place), there is also the possibility of doing a so-called blue-green  upgrade. We speak of blue-green  Apigee upgrades, when a completely new and empty cluster is installed, side by side to the original cluster and joining to the cassandra ring (blue/green upgrade) . Once this is achieved, usually the original cluster will be decomissioned. 


In such a case a blue-green approach will be the preferred approach in order to guarantee minimal application impact.


Pro:

  • Features risk-averse upgrade approaches
  • Easy to “fall back” since the original Apigee deployment remains untouched
  • Cluster feature set / configuration can be “cleaned up”
  • New Anthos/Kubernetes versions can easily be verified.

Cons:

  • Installation and migration can take longer than an in-place upgrade depending on the number of version hops. If it’s beyond 2 versions, this Con does not apply.
  • Higher resource usage / availability is required during the migration phase. Once the switch over is completed the old cluster can be decommissioned. 
  • Can involve complex load balancer config to route traffic between new and existing cluster

Most upgrades are done as in-place upgrades since it seems they have obvious advantages. 

If there are multiple environments such as test, dev and production, it is highly recommended to start with the least critical environment (for example: test) and verify functionality. Once successful, move on to the next least critical environment. This enables you to move from one criticality to the next and always verify the upgrade itself and the workloads functioning properly. 


Pro:

  • Less resource consumption
  • Reduced effort
  • No added load balancer complexity
  • Kubernetes handles application availability and failover within the cluster

Con:

  • Higher risk in case of failure
  • Difficult to fall back in case of an upgrade error 

Anatomy of Blue/Green Upgrade Process 

  1. Prepare Old Cluster

    1. Update hostNetwork=true in the existing clusters (Anthos 1.6). This is required for inter-node communication between the New Anthos cluster and existing cluster.
  1. Change hostNetwork=true in overrides file.  If this is not present, add in the cassandra section. 
  2. Run the following command to update the hostNetwork

$APIGEECTL_HOME/apigeectl apply -f overrides/overrides.yaml


  1. Install New Version of Anthos on New Cluster


  1. Install new version of  Anthos

You can follow this document to install a new anthos cluster.  Generally Anthos cluster installation has following sequence :


  • Install the Admin workstation 
  • Install the Admin Cluster.
  • Install all User Clusters 

Sections below discuss the configuration files that you can refer to install Anthos Cluster.

  1. Copy the existing admin-cluster.yaml from old cluster to new Cluster and make relevant changes. 
  2. Copy user-cluster.yaml from the old cluster and make appropriate changes. The nodepool section needs to be modified. 
  3. Copy the ip block sections from old cluster to new cluster and modify as required.
  4. Create Admin Cluster and User Cluster.
  1. Firewall Checks 

  1. Check if 7001 port is open between New Cluster and Old User Cluster nodes.
  1. NodeLabels

Node labels are added as part of configuration. Please check if that is properly added. You can follow the instructions to add a label.The apigee-data and apigee-runtime nodes need to be labeled appropriately. Follow the instructions here  to label nodes.

  1. Expand Cassandra Cluster between existing cluster and new Cluster.

    1. Copy the existing apigee configuration file from existing cluster to new cluster and make relevant changes. Ensure that instanceid is different from existing cluster.
    2. Copy the existing service accounts, certs from old cluster to new cluster and put them in relevant directory under hybrid-files.
    3. Install Apigee 1.7 in the new cluster.  Follow all steps till step 8 of part 2 from the document here
    4. Follow the instructions here to expand cassandra 
    5. Execute step 9 to finish installation.
    6. Move traffic to the new cluster.
  2. Testing and Migration

The new cluster has all the proxies and other apigee resources and they are synched during the cassandra sync process. With a Load balancer in place , this can be tested easily. 

  1. Decommission Existing Cluster

Follow this document to decommission the cluster. 


Anatomy of In-Place Upgrade Process 

  1. Apply your overrides to upgrade Cassandra
  2. Apply your overrides to upgrade Telemetry components and check completion
  3. Bring up Redis components.
  4. Apply your overrides to upgrade the org-level components (MART, Watcher and Apigee Connect) and check completion.
  5. Apply your overrides to upgrade your environments. You have two choices:
    • Environment by environment: Apply your overrides to one environment at a time and check completion. Repeat this step for each environment:
    • All environments at one time: Apply your overrides to all environments at once and check completion:

Upgrade  Scenarios

Scenario example 1 


In this scenario, a customer is running Apigee Hybrid 1.5.x  and wants to upgrade to the 1.7x version of Apigee.  When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components. 


A Blue/Green Upgrade pattern was suggested to the customer.The upgrade sequence was created and suggested to the customer. 


           
   

Anthos

ASM

Apigee

Note

   

DC1 & DC2 (existing)

 

Today

1.8.x

1.7.x

1.5.x

Current versions

 

Step 1

1.8.x

1.9.x

1.5.x

Upgrade ASM from 1.7 to 1.9 on existing clusters to prepare for compatibility with Apigee hybrid 1.6

 

Step 2

1.8.x

1.9.x

1.6.x

Upgrade Apigee hybrid from 1.5 to 1.6 on existing clusters, to allow for expanding to new clusters in DC3 & DC4

   

DC3 & DC4 (new)

 

Step 3

1.10.x

1.12

1.6.x

Install the same version of Apigee hybrid into new DCs, joining Cassandra to the existing ring of DC1 & DC2

 

Step 4

Decomission and disconnect DC1 & DC2

 

Step 5

1.10.x

1.12

1.7.x

Upgrade Apigee hybrid to 1.7

 

End state

TBD

TBD

TBD

Upgrade all software to latest versions available at the time


Fig : Blue/Green Deployment for Apigee Hybrid 

Scenario example 2



In this scenario, a customer is running Apigee Hybrid 1.5.x  and wants to upgrade to the 1.7x version of Apigee.  When analyzing further, we figured out that the customer is running a EOL version of Anthos, ASM and Apigee and thus it required upgrading all the components. In this case we are joining version 1.6.x and 1.5.x versions of Cassandra. 




 

Anthos

ASM

Apigee

Note

 

DC1 & DC2 (existing)

Today

1.6x

1.8.x

1.5.x

Current versions

 

DC3 & DC4 (new)

Step 2

1.10.x

1.12

1.6.x

Install the N+1 of Apigee hybrid into new DCs(With an Exception), Join Cassandra to the existing ring of DC1 & DC2

Step 3

Decommission and disconnect DC1 & DC2

Step 5

1.10.x

1.12

1.7.x

Upgrade Apigee hybrid to 1.7

End state

TBD

TBD

TBD

Upgrade all software to latest versions available at the time


Upgrade  Checklist


Upgrade preparations checklist:

  • Prepare backups for the user and admin clusters.
  • Prepare Cassandra Backups
  • Check cluster capacity (CPU, Memory and %Ready Times)
  • Check the user cluster utilization and available resources
  • Check Firewall ports and access to Google resources.
  • Availability of VIPs for Control Plane and Load Balancers.

Troubleshooting  Commands

Check whether the nodes are ready 

kubectl get nodes --kubeconfig {KUBECONFIG}



Look at the VERSION and AGE to figure out whether the node is upgraded. 

If the node is not ready, try to login the node, look at /var/log/startup.log and 

/var/log/cloud-init-output.log 

Check whether the pods are running. 

kubectl get pods --kubeconfig {KUBECONFIG} -A



Look at the AGE to figure out whether the pod is upgraded. 

Check whether there is error in cluster api controller log 

kubectl logs <cluster api controller pod name> -c vsphere-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -A -n {USER_CLUSTER_NAME} 

kubectl logs {cluster api controller pod name} -c clusterapi-controller-manager --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} -n {USER_CLUSTER_NAME}



NOTE: The above commands are for the worker nodes. If you want to inspect the controller log for status of the user cluster control plane nodes, replace the namespace with kube-system 

Check whether there is error in on prem user cluster controller log 

kubectl logs <onprem user cluster controller pod name> -c onprem-user-cluster-controller --kubeconfig {admin kubeconfig} -A -n kube-system


How can I find out what nodes are currently being drained?

These nodes have been cordoned and scheduling has been disabled. You can list all the nodes and check the SchedulingDisabled term in the STATUS column. 

E.g., to see only nodes that are being drained. 

kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} | grep "SchedulingDisabled"



To continuously “watch” all the nodes: 

kubectl get nodes --kubeconfig {USER_CLUSTER_KUBECONFIG} –watch



A Node has been in the status of draining for a very long time. How can I find out what’s going on? 

You can get the cluster API controller log for the node {NODE_NAME}. First, get the name of the cluster API controller pod. 

kubectl --kubeconfig {ADMIN_CLUSTER_KUBECONFIG} get pods -n {USER_CLUSTER_NAME} | grep clusterapi-controllers



Then get the log of that pod and filter with the node name*. 

kubectl logs {CLUSTER_API_POD_NAME} -c vsphere-controller-manager --kubeconfig {ADMIN_KUBECONFIG} -n {USER_CLUSTER_NAME} | grep "{NODE_NAME}"



* For clusters with the DHCP mode, node names are the same as the “machine” names. For clusters using static IPs. You will need to map the node name to the machine name to search the log. 

Locate draining errors in the cluster API controller log. 

If there are any draining errors, you should see a log message “Failed to completely drain nodein the log you found in the previous step. This log line will include the exact cause of the draining problem.


 

Contributors
Version history
Last update:
‎11-15-2022 01:18 PM
Updated by: