Customers deploying Apigee hybrid are often faced with challenges of sizing infrastructure for apigee workloads. It is not uncommon for customers to scramble around to look for guidance on how to design poc, non production, single region production and multi region production or the multi org per cluster deployment .
This document primarily captures the best practices for sizing the infrastructure on Anthos Bare metal. However this can also be referenced for other Anthos deployment or supported kubernetes platforms.
Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments.
In Anthos Bare Metal, there are Admin Clusters and User Clusters. Admin Workstation is used to create admin and user clusters. The Admin Cluster is responsible for managing user clusters. Both user cluster and Admin Cluster has Control Plane nodes running kubernetes masters and Worker nodes that allows you to run workloads.
Anthos Bare Metal solution provides different deployment options as shown below:
There are three models for deployment -
Apigee Hybrid can be installed on Anthos Bare Metal but it is necessary to understand the licensing model for Apigee and Anthos and design the topology architecture. Anthos licensing is based on the vCPU of user clusters and the billing cycle starts as soon as you register the cluster in Fleet.
Apigee provides license credit for 300 vCPU for enterprise and 800 vCPU for enterprise plus customers.While you can share your own workload along with apigee workload, this may result in the different billing and cost concerns.
For simplicity sake , for the rest of the document I will go ahead with the assumption that the Anthos Cluster built for the Apigee workload will be deployed on a separate Anthos cluster. The type of Anthos deployment options would depend on the SDLC environments and regions required for deployment.
Apigee has flexible deployment topologies but essentially there are two separate sets of workloads you need to manage. The stateful sets consist of Apigee cassandra components and runtime workloads that are stateless. These components scale differently and thus the Apigee advises you to put them into two separate nodepools - apigee-data for stateful workloads and apigee-runtime for stateless workloads.
Generally we advise building POC, Non-Production and Production environments for Apigee Hybrid. The sections below describe how you would design topology for these environments.
A typical Apigee poc environment is the play around environment that allows you to quickly setup the apigee environment and tear down after evaluation. In this case you can use the Standalone Anthos cluster deployment pattern and use a single nodepool to deploy both stateless and stateful workloads.
The Anthos Standard deployment model with 3 node will host Control Plane and Worker nodes of both admin and user cluster. Apigee hybrid can be deployed on this cluster with no nodepool labels for runtime and data.
The minimum hardware sizing for running POC nodes would be 3 nodes of 8 core, 16 GB for nodes accounting to hybrid workload. The POC environment can be used for basic apigee hybrid installation, quick onboarding of apis, learn and play but it must not be used for any performance tests. The correlation of hardware size to the api performance is mentioned in later sections of this document.
You would set a non production environment for your dev,qa or stage. In this case, it's advisable to consider the standard practice of installing 2 node pools - apigee-data and apigee-runtime when setting up the user cluster.
You can choose either a Standard or Hybrid installation model for Anthos installation. Anthos Hybrid mode deployment options allows you to define node pools for user clusters and thus makes it a better choice to create topology as per Apigee requirements.
For non production setup, you can choose to run the Control plane in non-HA mode. The control plane would host the k8 masters and bundled load balancers. In case you would like to have HA on those features, the recommended configuration is 3 control plane nodes. For this topology I would go with 1 control plane node.
Please also note that Anthos requires an additional 1 node for Upgrade. You can learn more about the best practices for upgrading Apigee on Anthos here.
In case your Non Production is multi regional, you would set up Anthos Hybrid Clusters in both regions. The nodes in apigee-data node pools need to communicate with each other over port 7001. The requirements are almost the same as a single region except now the node counts are duplicated.
For production, you should choose either Anthos Hybrid deployment model or the multi-cluster model. Admin Control Plane should be in HA mode so a minimum of 3 nodes would be required. The minimum node configurations for worker nodes would be 6. In case of Anthos’s Hybrid deployment , the user cluster’s bundled load balancers (Apigee LB) would be installed in the Admin Control Plane node.
The node sizing guidance based on tps and environments is provided in a later section.
This is not different from single region deployment except that you would deploy 2 Anthos clusters in 2 different regions and install Apigee Hybrid multi setup. The firewall ports need to be opened between data nodes across regions for cassandra to communicate.
There are cases when you want to deploy multiple Apigee Organizations on Single Anthos Cluster deployment. This is relevant when you are deploying say qa,dev, stage in a single cluster as different apigee organizations but want a single Anthos control and configuration. In this case, it's ideal to go with Anthos multi-cluster deployment model as this will give you better control on creating separate user clusters for each organization you host. You can also go with Anthos Hybrid deployment model but that may end up inconsistent experience across different user clusters. Each Apigee Organization will correspond to one user cluster and will have the topology for single org as defined above.
However for each organization, there will be a minimum of 3 nodes for User Control plane nodes for High Availability. In case of non production deployment , you can have only 1 user control plane node per additional User Cluster.
Refer to this community article for estimating the size - https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/Apigee-Hybrid-Estimating-Infrastructu...
The article above can give the sizing guidance and should be good enough for most use cases. However if you want to build your own sizing model, the section below gives some guidance on how to build your own model. This section gives an overview of how we can size the infrastructure to a fair estimate value. This model is loosely built on our experience and closely resembles the product guidance however the maths presented below is accurate with certain set of assumptions. Please reach out to Google Sales associate for guidance in advanced sizing estimates.
Nodepool |
Org/Env |
component/pods |
Definition/Purpose |
apigee-data |
org |
apigee-cassandra |
Cassandra stores the api management state. |
apigee-runtime |
env |
message processor |
This is the processing engine. |
env |
synchronizer |
This component is responsible for maintaining the proxy deployment desired state. |
|
env |
udca |
For pushing to analytics. |
|
org |
Ingress (istio ingress or apigee ingress) |
The L7 gateway for tls termination and routing to MP. |
|
org |
mart |
The is the management apis for runtime components. |
|
org |
connect |
This component establishes the long polling connection to the control plane for mart apis. |
|
org |
cert-manager |
Responsible for providing zero trust security architecture. |
|
org |
apigee-controller |
This is responsible for releasing apigee pods in kubernetes |
|
org |
metrics |
||
org |
redis |
Responsible for distributed quota management. |
|
org |
watcher |
Reports the deployment state to control the plane. |
|
org |
logger |
The is the daemonset that runs on all nodes that scrapes data from stderr and stdout and pushes to cloud logging. |
While some components are org level components, some are environment specific and the section below describes how we can size each one of them.
Variable |
Details |
Default Value |
TPS |
Total transaction per seconds going through the system. |
User Input |
NumOfEnvironments |
Total number of environments |
User Input |
MaxTPSPerCassandra |
TPS per cassandra instance can handle |
2500 |
MinimumCassandraPods |
The minimum number of pods needed in a region. |
3 |
vCPUPerCassandraPod |
The default cassandra size. |
4 or 8 |
CassandraNodevCPU |
Node size for data nodepool |
User Input , usually 4 or 8 |
RuntimeMax |
The max qps the runtime pod can handle |
400 |
UDCAMax |
The max tps per UDCA Pod |
750 |
MinMPPods |
The minimum runtime pods needed for any environment |
2 |
vCPUPerRuntimePod |
The default vCPU for runtime pods |
1 or 2 |
DefaultvCPUPerPod |
The default vCPU for other nodes like mart, synchronizer,ingress,udca |
1 |
MinPods |
The minimum pods for other runtime components |
2 |
FixedPods |
Total Fixed Pods. (mart, connect, cert-manager, apigee-controller, metrics-app,metris-proxy, redis-envoy,redis, watcher) |
12 |
FixedPodsvCPU |
The vCPU of each fixed pods |
0.5*12=6 |
FailoverNodes |
The failover factor is the number of additional nodes you need to keep to make the system work on full capacity in case there is a node failure. |
User Input, usually 0 or 1 |
RuntimeNodevCPU |
Node size for runtime nodepool |
User Input, usually 4 or 8 |
RoundUP |
Ceiling value |
|
Anthos Workloads |
Optional - ais (webhook), cloudops/gmp (agent), acm (gatekeeper webhook), metallb (daemonset), and asm |
0 |
SyncPods |
Synchronizer Pods |
You can derive some formula which can give some basic guidance on how we can size the infrastructure. As you tweak the parameters like max qps per runtime pods you can get some estimates that can work for you.
|
Number of Env |
TPS |
|||||
1000 |
2000 |
5000 |
||||
Data vCPU |
Runtime vCPU |
Data vCPU |
Runtime vCPU |
Data vCPU |
Runtime vCPU |
|
10 |
12 |
52 |
12 |
53 |
24 |
87 |
20 |
12 |
102 |
12 |
103 |
24 |
|
Anthos on Bare Metal allows up to 250 pods per node. It's important to size the network to scale to the required number of nodes. If your cluster uses the default value of /16 for the clusterNetwork.pods.cidrBlocks field, your cluster has a limit of (2(23-16))=128 nodes.
You can define the pods and service CIDR range in the following section below in the cluster yaml.
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 172.26.232.0/24
This section captures the pods required for running apigee on Anthos BM
This blogpost gives the topology design and infrastructure sizing with some assumptions. For more detailed calculations related to your unique requirements, pricing, and industry, talk to a Google Cloud Sales specialist to help to narrow down the details.