Background and Overview

rajeshmi · ‎01-04-2023

Background and Overview

Customers deploying Apigee hybrid are often faced with challenges of sizing infrastructure for apigee workloads. It is not uncommon for customers to scramble around to look for guidance on how to design poc, non production, single region production and multi region production or the multi org per cluster deployment .

This document primarily captures the best practices for sizing the infrastructure on Anthos Bare metal. However this can also be referenced for other Anthos deployment or supported kubernetes platforms.

Anthos Bare Metal Architecture

Anthos is Google’s hybrid and multi-cloud platform, providing a consistent ‘operating-system’ for cloud-native deployments across all site types (on-prem, Edge + public cloud). Anthos provides the tooling to deploy a Kubernetes layer on-premises and in the cloud with version and lifecycle sameness, and a consistent security and governance model across environments.

In Anthos Bare Metal, there are Admin Clusters and User Clusters. Admin Workstation is used to create admin and user clusters. The Admin Cluster is responsible for managing user clusters. Both user cluster and Admin Cluster has Control Plane nodes running kubernetes masters and Worker nodes that allows you to run workloads.

Anthos Bare Metal solution provides different deployment options as shown below:

There are three models for deployment -

Standard - In this model there are just control planes (Both user and admin cluster ) and Worker nodes share both Admin Workload and User Workload. This is most relevant if the cluster is managed by a single team.
Multi-Cluster Model - In this case, there is a full blown admin cluster and separate user clusters. Each user cluster has their own control plane.
Hybrid Model - This is in combination with Standard and Multi Cluster where you can run the 1st user cluster in the same admin cluster. The 1st user control plane will share the nodes with admin control plane nodes and user worker nodes will share with admin cluster nodes. The subsequent user cluster can be spinned as the same as the Multi Cluster model.

Apigee Hybrid can be installed on Anthos Bare Metal but it is necessary to understand the licensing model for Apigee and Anthos and design the topology architecture. Anthos licensing is based on the vCPU of user clusters and the billing cycle starts as soon as you register the cluster in Fleet.

Apigee provides license credit for 300 vCPU for enterprise and 800 vCPU for enterprise plus customers.While you can share your own workload along with apigee workload, this may result in the different billing and cost concerns.

For simplicity sake , for the rest of the document I will go ahead with the assumption that the Anthos Cluster built for the Apigee workload will be deployed on a separate Anthos cluster. The type of Anthos deployment options would depend on the SDLC environments and regions required for deployment.

Apigee Deployment Topologies

Apigee has flexible deployment topologies but essentially there are two separate sets of workloads you need to manage. The stateful sets consist of Apigee cassandra components and runtime workloads that are stateless. These components scale differently and thus the Apigee advises you to put them into two separate nodepools - apigee-data for stateful workloads and apigee-runtime for stateless workloads.

Generally we advise building POC, Non-Production and Production environments for Apigee Hybrid. The sections below describe how you would design topology for these environments.

Non Production- POC Environment

A typical Apigee poc environment is the play around environment that allows you to quickly setup the apigee environment and tear down after evaluation. In this case you can use the Standalone Anthos cluster deployment pattern and use a single nodepool to deploy both stateless and stateful workloads.

The Anthos Standard deployment model with 3 node will host Control Plane and Worker nodes of both admin and user cluster. Apigee hybrid can be deployed on this cluster with no nodepool labels for runtime and data.

The minimum hardware sizing for running POC nodes would be 3 nodes of 8 core, 16 GB for nodes accounting to hybrid workload. The POC environment can be used for basic apigee hybrid installation, quick onboarding of apis, learn and play but it must not be used for any performance tests. The correlation of hardware size to the api performance is mentioned in later sections of this document.

Non Production- Single Region

You would set a non production environment for your dev,qa or stage. In this case, it's advisable to consider the standard practice of installing 2 node pools - apigee-data and apigee-runtime when setting up the user cluster.

You can choose either a Standard or Hybrid installation model for Anthos installation. Anthos Hybrid mode deployment options allows you to define node pools for user clusters and thus makes it a better choice to create topology as per Apigee requirements.

For non production setup, you can choose to run the Control plane in non-HA mode. The control plane would host the k8 masters and bundled load balancers. In case you would like to have HA on those features, the recommended configuration is 3 control plane nodes. For this topology I would go with 1 control plane node.

Please also note that Anthos requires an additional 1 node for Upgrade. You can learn more about the best practices for upgrading Apigee on Anthos here.

Non Production - Multi-Region

In case your Non Production is multi regional, you would set up Anthos Hybrid Clusters in both regions. The nodes in apigee-data node pools need to communicate with each other over port 7001. The requirements are almost the same as a single region except now the node counts are duplicated.

Production - Single-Region, One Organization

For production, you should choose either Anthos Hybrid deployment model or the multi-cluster model. Admin Control Plane should be in HA mode so a minimum of 3 nodes would be required. The minimum node configurations for worker nodes would be 6. In case of Anthos’s Hybrid deployment , the user cluster’s bundled load balancers (Apigee LB) would be installed in the Admin Control Plane node.

The node sizing guidance based on tps and environments is provided in a later section.

Production - Multiple-Region, One Organization

This is not different from single region deployment except that you would deploy 2 Anthos clusters in 2 different regions and install Apigee Hybrid multi setup. The firewall ports need to be opened between data nodes across regions for cassandra to communicate.

Production - Single-Region, Multiple Organization

There are cases when you want to deploy multiple Apigee Organizations on Single Anthos Cluster deployment. This is relevant when you are deploying say qa,dev, stage in a single cluster as different apigee organizations but want a single Anthos control and configuration. In this case, it's ideal to go with Anthos multi-cluster deployment model as this will give you better control on creating separate user clusters for each organization you host. You can also go with Anthos Hybrid deployment model but that may end up inconsistent experience across different user clusters. Each Apigee Organization will correspond to one user cluster and will have the topology for single org as defined above.

However for each organization, there will be a minimum of 3 nodes for User Control plane nodes for High Availability. In case of non production deployment , you can have only 1 user control plane node per additional User Cluster.

Production - Multi-Region, Multiple Organization

Apigee Hybrid Sizing guidance

Refer to this community article for estimating the size - https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/Apigee-Hybrid-Estimating-Infrastructu...

The article above can give the sizing guidance and should be good enough for most use cases. However if you want to build your own sizing model, the section below gives some guidance on how to build your own model. This section gives an overview of how we can size the infrastructure to a fair estimate value. This model is loosely built on our experience and closely resembles the product guidance however the maths presented below is accurate with certain set of assumptions. Please reach out to Google Sales associate for guidance in advanced sizing estimates.

Component Taxonomy

Nodepool	Org/Env	component/pods	Definition/Purpose
apigee-data	org	apigee-cassandra	Cassandra stores the api management state.
apigee-runtime	env	message processor	This is the processing engine.
	env	synchronizer	This component is responsible for maintaining the proxy deployment desired state.
	env	udca	For pushing to analytics.
	org	Ingress (istio ingress or apigee ingress)	The L7 gateway for tls termination and routing to MP.
	org	mart	The is the management apis for runtime components.
	org	connect	This component establishes the long polling connection to the control plane for mart apis.
	org	cert-manager	Responsible for providing zero trust security architecture.
	org	apigee-controller	This is responsible for releasing apigee pods in kubernetes
	org	metrics
	org	redis	Responsible for distributed quota management.
	org	watcher	Reports the deployment state to control the plane.
	org	logger	The is the daemonset that runs on all nodes that scrapes data from stderr and stdout and pushes to cloud logging.

While some components are org level components, some are environment specific and the section below describes how we can size each one of them.

Definitions

Variable	Details	Default Value
TPS	Total transaction per seconds going through the system.	User Input
NumOfEnvironments	Total number of environments	User Input
MaxTPSPerCassandra	TPS per cassandra instance can handle	2500
MinimumCassandraPods	The minimum number of pods needed in a region.	3
vCPUPerCassandraPod	The default cassandra size.	4 or 8
CassandraNodevCPU	Node size for data nodepool	User Input , usually 4 or 8
RuntimeMax	The max qps the runtime pod can handle	400
UDCAMax	The max tps per UDCA Pod	750
MinMPPods	The minimum runtime pods needed for any environment	2
vCPUPerRuntimePod	The default vCPU for runtime pods	1 or 2
DefaultvCPUPerPod	The default vCPU for other nodes like mart, synchronizer,ingress,udca	1
MinPods	The minimum pods for other runtime components	2
FixedPods	Total Fixed Pods. (mart, connect, cert-manager, apigee-controller, metrics-app,metris-proxy, redis-envoy,redis, watcher)	12
FixedPodsvCPU	The vCPU of each fixed pods	0.5*12=6
FailoverNodes	The failover factor is the number of additional nodes you need to keep to make the system work on full capacity in case there is a node failure.	User Input, usually 0 or 1
RuntimeNodevCPU	Node size for runtime nodepool	User Input, usually 4 or 8
RoundUP	Ceiling value
Anthos Workloads	Optional - ais (webhook), cloudops/gmp (agent), acm (gatekeeper webhook), metallb (daemonset), and asm	0
SyncPods	Synchronizer Pods

You can derive some formula which can give some basic guidance on how we can size the infrastructure. As you tweak the parameters like max qps per runtime pods you can get some estimates that can work for you.

Illustrations

Number of Env	TPS
	1000		2000		5000
	Data vCPU	Runtime vCPU	Data vCPU	Runtime vCPU	Data vCPU	Runtime vCPU
10	12	52	12	53	24	87
20	12	102	12	103	24

Network Sizing

Anthos on Bare Metal allows up to 250 pods per node. It's important to size the network to scale to the required number of nodes. If your cluster uses the default value of /16 for the clusterNetwork.pods.cidrBlocks field, your cluster has a limit of (2(23-16))=128 nodes.

You can define the pods and service CIDR range in the following section below in the cluster yaml.

clusterNetwork:

pods:

cidrBlocks:

- 192.168.0.0/16

services:

cidrBlocks:

- 172.26.232.0/24

This section captures the pods required for running apigee on Anthos BM

There are many other factors that determines the total network ip addresses required along with total number of nodes. However for this article I will focus just on Total Nodes and you can find out more from the article I have linked above.

Conclusion

This blogpost gives the topology design and infrastructure sizing with some assumptions. For more detailed calculations related to your unique requirements, pricing, and industry, talk to a Google Cloud Sales specialist to help to narrow down the details.

Best Practices for Designing Apigee Topology and Sizing for Anthos Deployment

Background and Overview

Anthos Bare Metal Architecture

Apigee Deployment Topologies

Non Production- POC Environment

Non Production- Single Region

Non Production - Multi-Region

Production - Single-Region, One Organization

Production - Multiple-Region, One Organization

Production - Single-Region, Multiple Organization

Production - Multi-Region, Multiple Organization

Apigee Hybrid Sizing guidance

Component Taxonomy

Definitions

Illustrations

Network Sizing

Conclusion