Best Practices for Designing Apigee Topology and Sizing for Anthos Deployment

Background and Overview

Customers deploying Apigee hybrid are often faced with challenges of  sizing infrastructure for apigee workloads. It is not uncommon for customers to scramble around to look for guidance on how to design poc, non production, single region production and multi region production or the multi org per cluster deployment . 

This document primarily captures the best practices for sizing the infrastructure on Anthos Bare metal. However this can also be referenced for other Anthos deployment or supported kubernetes platforms. 

Anthos Bare Metal Architecture

Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments. 

rajeshmi_0-1672869026804.png

 

 

In Anthos Bare Metal, there are Admin Clusters and User Clusters. Admin Workstation is used to create admin and user clusters.  The Admin Cluster is responsible for managing user clusters. Both user cluster and Admin Cluster has Control Plane nodes running kubernetes masters and Worker nodes that allows you to run workloads.

Anthos Bare Metal solution provides different deployment options  as shown below:

rajeshmi_1-1672869026792.png

rajeshmi_2-1672869026809.png

 

There are three models for deployment -

  • Standard - In this model there are just control planes (Both user and admin cluster ) and Worker nodes share both Admin Workload and User Workload. This is most relevant if the cluster is managed by a single team.
  • Multi-Cluster Model - In this case, there is a full blown admin cluster and separate user clusters. Each user cluster has their own control plane.
  • Hybrid Model - This is in combination with Standard and Multi Cluster where you can run the 1st user cluster in the same admin cluster. The 1st user control plane will share the nodes with admin control plane nodes and user worker nodes will share with admin cluster nodes. The subsequent user cluster can be spinned as the same as the Multi Cluster model.

 

Apigee Hybrid can be installed on Anthos Bare Metal but it is necessary to understand the licensing model for Apigee and Anthos and design the topology architecture. Anthos licensing is based on the vCPU of user clusters and the billing cycle starts as soon as you register the cluster in Fleet. 

Apigee provides license credit for 300 vCPU for enterprise and 800 vCPU for enterprise plus customers.While you can share your own workload along with apigee workload, this may result in the different billing and cost concerns.

For simplicity sake , for the rest of the document I will go ahead with the assumption that the Anthos Cluster built for the Apigee workload will be deployed on a separate Anthos cluster.  The type of Anthos deployment options would depend on the SDLC environments and regions required for deployment. 

Apigee Deployment Topologies

Apigee has flexible deployment topologies but essentially there are two separate sets of workloads you need to manage. The stateful sets consist of Apigee cassandra components  and runtime workloads that are stateless. These components scale differently and thus the Apigee advises you to put them into two separate nodepools - apigee-data for stateful workloads and apigee-runtime for stateless workloads.

Generally we advise building  POC, Non-Production and Production environments for Apigee Hybrid. The sections below describe how you would design topology for these environments.

Non Production- POC Environment

A typical Apigee poc environment is the play around environment that allows you to quickly setup the apigee environment and tear down after evaluation. In this case you can use the Standalone Anthos cluster deployment pattern and use a single nodepool to deploy both stateless and stateful workloads.

The Anthos Standard deployment model with 3 node will host Control Plane and Worker nodes of both admin and user cluster. Apigee hybrid can be deployed on this cluster with no nodepool labels for runtime and data.

 

rajeshmi_3-1672869026790.png

The minimum  hardware sizing for running POC nodes would be 3 nodes of  8 core, 16 GB for nodes accounting to hybrid workload.  The POC environment can be used for basic apigee hybrid installation, quick onboarding of apis, learn and play but it must not be used for any performance tests. The correlation of hardware size to the api performance is mentioned in later sections of this document. 

 

rajeshmi_4-1672869026762.png

 

Non Production- Single Region

You would set a non production environment for your dev,qa or stage. In this case, it's advisable to consider the standard practice of installing 2 node pools - apigee-data and apigee-runtime when setting up the user cluster. 

You can choose either a Standard or Hybrid installation model for Anthos installation. Anthos Hybrid mode deployment options allows you to define node pools for user clusters and thus makes it a better choice to create topology as per Apigee requirements.

For non production setup, you can choose to run the Control plane in non-HA mode.  The control plane would host the k8 masters and bundled load balancers. In case you would like to have HA on those features, the recommended configuration is 3 control plane nodes. For this topology I would go with 1 control plane node.

Please also note that Anthos requires an additional  1 node for Upgrade. You can learn more about the best practices for upgrading Apigee on Anthos here.

 

rajeshmi_5-1672869026732.png

Non Production - Multi-Region

In case your Non Production is multi regional, you would set up Anthos Hybrid Clusters in both regions. The nodes in apigee-data node pools need to communicate with each other over port 7001. The requirements are almost the same as a single region except now the node counts are duplicated. 

 

rajeshmi_6-1672869026813.png

rajeshmi_7-1672869026776.png

Production - Single-Region, One Organization

For production, you should choose either Anthos Hybrid deployment model or the multi-cluster model. Admin Control Plane should be in HA mode so a minimum of 3 nodes would be required. The minimum node configurations for worker nodes would be 6. In case of Anthos’s Hybrid deployment , the user cluster’s bundled  load balancers (Apigee LB)  would be installed in the Admin Control Plane node. 

The node sizing guidance based on tps and environments is provided in a later section.

rajeshmi_8-1672869026785.png

 

Production - Multiple-Region, One Organization

This is not different from single region deployment except that you would deploy 2 Anthos clusters in 2 different regions and  install Apigee Hybrid multi setup. The firewall ports need to be opened between data nodes across regions for cassandra to communicate. 

rajeshmi_9-1672869026812.png

 

rajeshmi_10-1672869026727.png

Production - Single-Region, Multiple Organization

There are cases when you want to deploy multiple Apigee Organizations on Single Anthos Cluster deployment. This is relevant when you are deploying say qa,dev, stage in a single cluster as different apigee organizations but want a single Anthos control and configuration. In this case, it's ideal to go with Anthos  multi-cluster deployment model as this will give you better control on creating  separate user clusters for each organization you host. You can also go with Anthos Hybrid deployment model but that may end up inconsistent experience across different user clusters.  Each Apigee Organization  will correspond to one user cluster and will have the topology for single org as defined above.

However for each organization, there will be a minimum of 3 nodes for User Control plane nodes for High Availability. In case of non production deployment , you can have only 1 user control plane node per additional User Cluster. 

rajeshmi_11-1672869026795.png

 

Production - Multi-Region, Multiple Organization

rajeshmi_12-1672869026810.png

 

rajeshmi_13-1672869026793.png

 


Apigee Hybrid Sizing guidance 

Refer to this community article for estimating the size - https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/Apigee-Hybrid-Estimating-Infrastructu...

The article above can give the sizing guidance and should be good enough for most use cases. However if you want to build your own sizing model, the section below gives some guidance on how to build your own model. This section gives an overview of how we can size the infrastructure to a fair estimate value. This model is loosely built on our experience and closely resembles the product guidance however the maths presented below is accurate with certain set of assumptions. Please reach out to Google Sales associate for guidance in advanced sizing estimates. 

Component Taxonomy 

Nodepool

Org/Env 

component/pods

Definition/Purpose

apigee-data

org

apigee-cassandra

Cassandra stores the api management state.

apigee-runtime

env

message processor

This is the processing engine.

env

synchronizer

This component is responsible for maintaining the proxy deployment desired state. 

env

udca

For pushing to analytics.

org

Ingress (istio ingress or apigee ingress)

The L7 gateway for tls termination and routing to MP.

org

mart

The is the management apis for runtime components.

org

connect

This component establishes the long polling connection to the control plane for mart apis.

org

cert-manager

Responsible for providing zero trust security architecture.

org

apigee-controller

This is responsible for releasing apigee pods in kubernetes

org

metrics

 

org

redis

Responsible for distributed quota management.

org

watcher

Reports the deployment state to control the plane.

org

logger

The is the daemonset that runs on all nodes that scrapes data from stderr and stdout and pushes to cloud logging.

While some components are org level components, some are environment specific and the section below describes how we can size each one of them.

Definitions

Variable

Details

Default Value

TPS

Total transaction per seconds going through the system.

User Input

NumOfEnvironments

Total number of environments

User Input

MaxTPSPerCassandra

TPS per cassandra instance can handle

2500

MinimumCassandraPods

The minimum number of pods needed in a region.

3

vCPUPerCassandraPod

The default cassandra size. 

4 or 8

CassandraNodevCPU

Node size  for data nodepool

User Input , usually 4 or 8 

RuntimeMax

The max qps the runtime pod can handle

400

UDCAMax

The max tps per UDCA Pod

750

MinMPPods

The minimum runtime pods needed for any environment

2

vCPUPerRuntimePod

The default vCPU for runtime pods

1  or 2

DefaultvCPUPerPod

The default vCPU for other nodes like mart, synchronizer,ingress,udca

1

MinPods

The minimum pods for other runtime components

2

FixedPods

Total Fixed Pods. 

(mart, connect, cert-manager, apigee-controller, metrics-app,metris-proxy, redis-envoy,redis, watcher)

12

FixedPodsvCPU

The vCPU of each fixed  pods

0.5*12=6

FailoverNodes

The failover factor is the number of additional nodes you need to keep to make the system work on full capacity in case there is a node failure.

User Input, usually 0 or 1

RuntimeNodevCPU

Node size  for runtime nodepool

User Input, usually 4 or 8

RoundUP

Ceiling value

 

Anthos Workloads

Optional - ais (webhook), cloudops/gmp (agent), acm (gatekeeper webhook), metallb (daemonset), and asm 

0

SyncPods

Synchronizer Pods

 

You can derive some formula which can give some basic guidance on how we can size the infrastructure.  As you tweak the parameters like max qps per runtime pods you can get some estimates that can work for you. 

 

 
rajeshmi_1-1680136201014.gif
 
rajeshmi_0-1692163438837.gif

 


rajeshmi_2-1680135616176.png
rajeshmi_3-1680135616176.png
rajeshmi_4-1680135616176.png
rajeshmi_5-1680135616176.png
rajeshmi_6-1680135616176.png

rajeshmi_7-1680135616176.png
rajeshmi_8-1680135616177.png
rajeshmi_9-1680135616177.png
 
rajeshmi_0-1680136097257.gif

 

 

Illustrations

Number of Env

TPS

 

1000

2000

5000

Data vCPU

Runtime vCPU

Data vCPU

Runtime vCPU

Data vCPU

Runtime vCPU

10

12

52

12

53

24

87

20

12

102

12

103

24

 

Network Sizing

Anthos on Bare Metal allows up to 250 pods per node. It's important to size the network to scale to the required number of nodes. If your cluster uses the default value of /16 for the clusterNetwork.pods.cidrBlocks field, your cluster has a limit of (2(23-16))=128 nodes. 

You can define the pods and service CIDR range in the following section below in the cluster yaml.

clusterNetwork:

    pods:

      cidrBlocks:

      - 192.168.0.0/16

    services:

      cidrBlocks:

      - 172.26.232.0/24

 

This section captures the pods required for running apigee on Anthos BM

 


rajeshmi_11-1680135709305.png
rajeshmi_12-1680135709305.png
rajeshmi_13-1680135709305.png
rajeshmi_0-1680136097257.gif
 
There are many other factors that determines the total network ip addresses required along with total number of nodes. However for this article I will focus just on Total Nodes and you can find out more from the article I have linked above.

Conclusion

This blogpost gives the topology design and infrastructure sizing with some assumptions. For more detailed calculations related to your unique requirements, pricing, and industry, talk to a Google Cloud Sales specialist to help to narrow down the details.

 

Version history
Last update:
‎08-15-2023 10:34 PM
Updated by: