Apigee Private Cloud - Performance Optimization

Report Inappropriate Content · ‎01-03-2017

System performance optimization is an iterative process. Depending up on the complexity of the APIs and custom configuration of components, it can be challenging to identify the root cause of performance issues.

This article provides guidance to identify the problem areas, fine tune the configuration and improve the system performance of Apigee private cloud deployments.

Router, Message Processor and Cassandra components are involved in processing API requests within an Apigee Edge planet.

Router and Message Processor

Following section of the article provides guideline for performance tuning Router and Message Processor components.

Router and Message Processor JVM Parameters

JVM heap min and max should be set according to the memory usage of the application.

Set min and max heap size to the same value.

Set max perm size to 256m. This is not required for Java 8.

Modify below parameters in /opt/apigee/customer/application/message-processor.properties for setting above values

bin_setenv_min_mem=3072m

bin_setenv_max_mem=3072m

bin_setenv_max_permsize=256m

# Below entry sets the meta space size for Jdk 1.8 to the perm max size.

bin_setenv_meta_space_size=${bin_setenv_max_permsize}

These settings need to be fine tuned by conducting performance tests.

Note:- Increase JVM heap size beyond 3GB only if application (Message Processor) has high memory usage. JVM needs to work hard for garbage collection activities when the heap size increases and it may cause higher CPU utilization. This should be considered in the VM / machine capacity planning.

Router and Message Processor JMX Port

Add below parameters to enable JMX port on JVM for remote connection. Use tools such as Jconsole to monitor the JVM health during a performance test by connecting to the JMX port.

Edit /opt/apigee/edge-message-processor/bin/start and /opt/apigee/edge-router/bin/start files and add following parameters.

-Dcom.sun.management.jmxremote

-Dcom.sun.management.jmxremote.port=<port number>

-Dcom.sun.management.jmxremote.local.only=false

-Dcom.sun.management.jmxremote.authenticate=false

-Dcom.sun.management.jmxremote.ssl=false

Router and Message Processor http timeouts

Depending upon the target and Message Processor proxy latencies following timeouts on Router and Message Processor may need to be adjusted

Router - Modify /opt/apigee/customer/application/router.properties file

#Router client timeout

conf_load_balancing_load.balancing.driver.server.keepalive.timeout=<value in milliseconds>

#Router Message Processor timeout

conf_load_balancing_load.balancing.driver.proxy.read.timeout=<value in milliseconds>

Message Processor - Modify /opt/apigee/customer/application/message-processor.properties file

#Message Processor target call timeout

conf_http_HTTPTransport.io.timeout.millis= <value in milliseconds>

Timeout values should be set such that MP Target timeout < Router MP timeout < Router client timeout.

Performance Tests and Data collection

Series of tests may be required to identify the root cause of a performance issue and fix it.

Performance tests

1.Breakpoint test
Identify the system break point by gradually adding load and sustaining it for some period. Benchmark system breakpoint in an initial performance test run. Typically these tests are conducted for two hours timeframe.

2.Stress Test
Push the load just below the breakpoint for an extended period. This test may reveal thread blocking / contention issues and / or memory issues.

3.Endurance Test
Run performance test with load above expected average TPS in PROD for an extended period. Typically these tests are conducted for 12 hours to 24 hours. Primary focus for these tests is identifying JVM heap / memory management issues, out of memory errors, etc.

Data Collection

Following set of data is critical to identify root cause of performance issues.

1.Jstack dumps for Message Processor and Router JVMs - Jstack dumps are thread dumps from the JVM using the tool “Jstack” bundled in JDK. Take dumps every 2 minutes through out the test.
Problem Areas: Look for many threads with "BLOCKED" status in the same stack trace. Thread blocks will happen within the system but if many threads are getting blocked in the same stack trace in multiple Jstack dumps through out the test then that indicates a problem area. If these threads are blocked for the same object lock then it may require a code change to fix the issue.

2.netstat dumps - Netstat dumps will provide number of sockets in different network status (ESTABLISHED, TIME_WAIT, CLOSE_WAIT). This needs to be collected in both Router and Message Processor components.
Problem Areas: Look for too many sockets with CLOSE_WAIT, this indicates application is not closing the sockets properly and eventually server will run out of the maximum number of sockets allowed for the component.

3.Router and Message Processor Logs - Make sure the Message Processor and Router logging level is INFO or higher. Severe performance degradation will be there if the logging level is DEBUG
Problem Areas: Too many errors getting generated could be potentially a problem, further investigate the code. If Router logs has "Connection Reset by Peer" error messages then that indicates backend or MP is taking more time to process the transaction and Router is closing the http connection before the response is ready on MP. Adjust the http timeouts on Router and MP according to the API use case.

4.Heap dumps - Take heap dumps at the beginning and end of the test and compare them to identify any potential memory handling issues. It is a good idea to take heap dumps when there are performance issues during the test too. They are handy to identify memory leaks and other memory related issues such as creating huge number / size of objects for every request and throwing them away. Taking a heap dump can impact system performance, as it is a stop of the world scenario for JVM. It should be done only when there are performance issues or at the beginning and end of the test.
Tip – There are many tools for viewing heap dumps but they do not work consistently on all heap dumps. Make sure the heap size displayed is matching with JVM configuration. If heap size is too low then most likely the tool is failing to parse the dump, try another one.

5.Jmap Class Histogram - sometimes it is hard to catch application memory usage pattern through heap dump as it gives a snapshot at that time. Class histogram data will give the number of objects created per class and memory occupied in the heap. These dumps don't impact system performance if the "live" option is not used for the Jmap command. More frequent (2 minutes) dumps are always better for analysis. The "live" option will trigger Full GC and will impact the performance.

Thread pool Tuning

Most of the private cloud customers may not require customization of Message Processor main thread pool configuration. Default max pool size is 100. If "Apigee-Main" thread pool is staying at the max limit and tasks are getting rejected by the pool then increasing the thread pool max limit may help to fix the problem. Expand the pool in the increments of "100" and run tests after each change.

If there is high number of threads in the pool then context switching and maintaining the threads will cause significant performance overhead. Reduce the keep-alive setting if there is high number of idle threads in the pool through out the test.

Adjust the Message Processor thread pool configuration in apigee/edge-message-processor/token/default.properties file

#Default value 60000, adjust this entry if there are too many idle threads during the test.

conf_threadpool_keepalive.time=<Value in milliseconds>

#Default value 100, increase in steps of 100 (200, 300 etc). Conduct performance test after each change.

conf_threadpool_maximum.pool.size=

#Default value is 10, when all the threads in the pool are busy, tasks will get pushed to this queue. Adjust this parameter along with max thread pool size for fine-tuning the thread pool.

conf_threadpool_pool.queue.size=

Garbage Collection Tuning

Add -verbose:gc argument to Router and Message Processor JVMs in performance test environment. This will log GC collection activities in system.out. Redirect the JVM process stdout to system.log by modifying the start up command. This gives verbose information on young, old generation and Full GC activities.

Start with default GC settings for the JVMs. Use the JVM heap settings specified in the first section of this article. If the GC collection time increases with duration of the test or there are many old generation GCs after sometime into the test then that indicates a problem in heap usage. This will impact system performance. It is recommended to try fine tuning JVM GC parameters as a first resort.

If the application is creating and destroying too many big objects, it may require a code change to reduce the number of objects and GC fine-tuning may not fix the issue.

JVM GC settings

6.Use CMS GC collection algorithm if the default algorithm is not working well for the set up

1.-XX:+UseConcMarkSweepGC

2.-XX:+CMSParallelRemarkEnabled

3.-XX:+CMSClassUnloadingEnabled

7.Disable explicit GC from the application (-XX:+DisableExplicitGC).

8.Print GC activity timestamp in stdout (-XX:+PrintGCDateStamps). This will help to identify the frequency of GC cycles.

9.Divide the heap equally to New and Old generations (-XX:NewRatio=1). This may need to be changed depending upon the duration of live objects in the heap for a customer set up.

10.Survivor ratio (-XX:SurvivorRatio=8). This is a good starting point and may need to be changed depending upon the application.

11.Use parallel threads for young generation collection(-XX:+UseParNewGC).

12.Number of parallel threads (-XX:ParallelGCThreads=2 for 8 core CPU). By default JVM allocates 1 thread per CPU core, it may not be required, start with allocating 1/4 CPU cores to GC threads.

13.Old generation collection initiation settings (-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly). This indicates initiate old generation collection when old generation heap is 75% full. This value may need to be fine-tuned based on the memory usage of the application. "UseCMSInitiatingOccupancyOnly" should be set to enforce XX:CMSInitiatingOccupancyFraction.

Note:- G1 algorithm - Java 1.7 and up has an option for a new garbage collection algorithm named G1. This works well when the heap size is big (heap > 3GB). Consider G1 algorithm if the heap size is high.

Cassandra

Find the optimized settings for Cassandra here http://docs.datastax.com/en/cassandra/2.0/cassandra/install/installRecommendSettings.html

ccovney · ‎01-24-2017

Hi @akinadiyil

I've tried changing all the timeout settings for both router and message processor which you mention above, and I still see an HTTP 504 Gateway timeout after ~57000ms. I expect the response to complete in roughly 120 seconds, hence the need to augment the timeout settings.

I changed the config values using the best practices, as described here:

http://docs.apigee.com/private-cloud/latest/how-configure-edge

My config files look like this:

[root@ilapg application]# pwd
/opt/apigee/customer/application

[root@ilapg application]# ls
message-processor.properties  router.properties

[root@ilapg application]# cat router.properties
conf_router_ServerContainer.io.timeout.millis=620000
conf_router_Client.pool.iotimeout=610000
conf_http_HTTPTransport.io.timeout.millis=600000

[root@ilapg application]# cat message-processor.properties
bin_setenv_min_mem=1024m
bin_setenv_max_mem=6144m
bin_setenv_max_permsize=1024m
conf_nodejs_connect.ranges.denied=
conf_http_HTTPTransport.io.timeout.millis=600000

When I restart the RMP, I do see the output saying that the above config values have been updated as expected. However, there is no effect on behavior.

What timeout setting am I missing here? I can't seem to find anything written about this in the documentation either. Thanks in advance for your help!

Best,

Chris

Report Inappropriate Content · ‎01-24-2017

@Chris Covney Please modify below files for timeout value as mentioned in this article.

Router - apigee/edge-router/token/default.properties

Message Processor- apigee/edge-message-processor/token/default.properties

ccovney · ‎01-25-2017

Hi @akinadiyil

Thanks for clarifying. Why are the Apigee docs advertising that the only to change settings is through the apigee/application/customer/*.properties files, when in fact for timeout settings one must edit the default values directly?

EDIT/UPDATE: I changed the values in the default.properties files to 600000 ms for each of the three timeout-related properties and bounced both the RMPs, and still no effect. The timeout still occurs after 57000ms. What is going? What setting are we missing here?

Best,

Chris

ccovney · ‎01-26-2017

@akinadiyil

Thanks again for your help. I wanted to follow up with you to see if you were able to figure out which settings we are missing in order to increase the RMP timeout settings.

I've tried every combination of your instructions plus the official Apigee docs. I still cannot determine how to change the timeout settings. Please let me know if you determine the missing step. Thanks!

Best,

Chris

Report Inappropriate Content · ‎01-30-2017

@Chris Covney - I have updated the Router timeout properties. Looks like there was a change in these property names. Please try these properties and let me know incase of any issues.

g-gagan1000 · ‎03-22-2019

@akinadiyil

Is it advisable to edit the message processor start script to add the parameters for Garbage Collection or is there any other better way/best practice for achieving this?