Apigee Edge Microgateway Vertical and Horizontal scaling for 1000 TPS and beyond.

kkleva
Participant V

Hello,

I'm looking for some quick insights regrading best practices for vertically and horizontally scaling Apigee Micro-gateway instances.

Vertical - Can / should I run multiple micro-gateway node.js processes on the same physical server?

Horizontal - What should I consider when running multiple micro-gateways across physical servers?

Apigee Edge Analytics - When sending 1000+ TPS into scaled micro-gateways, how should we plan capacity on Apigee Edge Cloud with Analytics Ingest?

Your thoughts are appreciated!

Solved Solved
1 13 2,341
1 ACCEPTED SOLUTION

prabhat
Participant V

Hi Kristopher,

Our recommendation is to run multiple instances of MGWY on different servers front by load balancers like nginx.

It's possible to run multiple instances of MGWY on same server using node's cluster module. In upcoming releases we are working on rearranging few things so that multiple instances can "easily" run on same server but our recommendation to stick with multiple servers.

View solution in original post

13 REPLIES 13

prabhat
Participant V

Hi Kristopher,

Our recommendation is to run multiple instances of MGWY on different servers front by load balancers like nginx.

It's possible to run multiple instances of MGWY on same server using node's cluster module. In upcoming releases we are working on rearranging few things so that multiple instances can "easily" run on same server but our recommendation to stick with multiple servers.

There are some aspects of running Node.js at the scale that I think we are leaving out here. I've found pretty quickly 'open file ulimit' is at least one BLOCKER I've found attempting to get Edge Micro to run at scale. See: https://community.apigee.com/questions/15162/edge-microgateway-error-during-load-testing.html

I've still not been able to reach my 1000TPS goal so I will continue to make attempts and tuning the operating system network and test.

In terms of load balancers. What sort of indicators would prompt you to use auto scaling? Would it be advisable if we are good a predicting traffic?

Update. We've succeeded running our 1000 TPS load test with just a bit more network and operating system tuning.

This is awesome. If you can, pls share network and os tuning you had to done for others benefits.

Share please.

Not applicable

Thanks Kristopher - that means that you didnt see 1000 tps on one instance then? To get this you load balanced multiple instances. Did you put these instances on one compute instance or did you spread it out across multiple?

@Kristopher Kleva

could you share the network tuning you did? im curious what i might be missing.

Here are a couple things to consider right away.

#1 Ulimits - It appears that the number of open files when attempting high TPS is higher than most operating systems has set as a default

#2 Load balancing across micro gateways is a must. I found pretty quickly we ran into issue with a single node and didn't get great stability till I was was load-balancing across 4-5 nodes.

For networking looking into increasing your TCP buffer maximum and limits. Google 'linux TCP tuning' and there are all sorts of articles that can help guide you through the basics.

I was not able to get to 1000 TPS with a single instance. However, I didn't see this as a limitation of the Edge Micro-gateway but more able the hosting system and my API Target. I was finding that depending on the response time of the target I'd run into issues as the throughput was increased. I would not recommend using Edge Micro in production without load-balancing but in dev, qa it should be fine.

prabhat
Participant V

We did find a performance related bug that we now have fixed. It will be available in next release.

@prabhat I want to share what we are seeing now w/ micro gateway. for the record we are testing w/ the fixes both in the edge micro code, and in one of the policies in the gateway its self - and we are testing on premise.

we found that we still had what appeared to be a bottleneck where the systems would start to fail after about 30 seconds of work - but we were pretty sure that this was where linux network tuning would come in handy.

We are running 2 sets of tests:

1) gatling

2) jmeter w/ slave traffic generators

I will focus on the gatling test settings first - because they are the ones that i have at hand.

On the traffic generator server (the gatling box) i made the following changes:

sudo sysctl net.ipv4.ip_local_port_range="15000 61000"
sudo sysctl net.ipv4.tcp_fin_timeout=10
sudo sysctl net.ipv4.tcp_tw_recycle=1
sudo sysctl net.ipv4.tcp_tw_reuse=1 

on the server running the edge micro installation we made these changes:

sudo sysctl net.ipv4.ip_local_port_range="15000 61000" 
sudo sysctl net.ipv4.tcp_fin_timeout=10
sudo sysctl net.ipv4.tcp_tw_recycle=1
sudo sysctl net.ipv4.tcp_tw_reuse=1 

sudo sysctl net.core.somaxconn=1024


ifconfig eth0 txqueuelen 5000
echo "/sbin/ifconfig eth0 txqueuelen 5000" >> /etc/rc.local
sudo sysctl net.core.netdev_max_backlog=2000
sysctl net.ipv4.tcp_max_syn_backlog=2048

If you want the WHYs go read this wonderful stackoverflow article:

http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-lin...

Note - i am in the process of running a test for 30 minutes against a single core node.js edgemicro install. It did pretty well (not perfectly) - but this was also in our internal cloud which is not as performant as AWS is.

I will be re-running the tests against 1000, and 1200 req/second tonight to see where our hardware gives up.

one further note: our openfiles are set to unlimited 🙂 so that was not part of our problem. In the process of pushing a single instance of edgemicro i did change it to 800k or something just to make sure that the unlimited wasnt some misunderstanding. With the rest of the settings this didnt make any difference (obviously..)