System performance optimization is an iterative process. Depending up on the complexity of the APIs and custom configuration of components, it can be challenging to identify the root cause of performance issues.
This article provides guidance to identify the problem areas, fine tune the configuration and improve the system performance of Apigee private cloud deployments.
Router, Message Processor and Cassandra components are involved in processing API requests within an Apigee Edge planet.
Router and Message Processor
Following section of the article provides guideline for performance tuning Router and Message Processor components.
Router and Message Processor JVM Parameters
JVM heap min and max should be set according to the memory usage of the application.
Set min and max heap size to the same value.
Set max perm size to 256m. This is not required for Java 8.
Modify below parameters in /opt/apigee/customer/application/message-processor.properties for setting above values
bin_setenv_min_mem=3072m
bin_setenv_max_mem=3072m
bin_setenv_max_permsize=256m
# Below entry sets the meta space size for Jdk 1.8 to the perm max size.
bin_setenv_meta_space_size=${bin_setenv_max_permsize}
These settings need to be fine tuned by conducting performance tests.
Note:- Increase JVM heap size beyond 3GB only if application (Message Processor) has high memory usage. JVM needs to work hard for garbage collection activities when the heap size increases and it may cause higher CPU utilization. This should be considered in the VM / machine capacity planning.
Router and Message Processor JMX Port
Add below parameters to enable JMX port on JVM for remote connection. Use tools such as Jconsole to monitor the JVM health during a performance test by connecting to the JMX port.
Edit /opt/apigee/edge-message-processor/bin/start and /opt/apigee/edge-router/bin/start files and add following parameters.
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port number>
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
Router and Message Processor http timeouts
Depending upon the target and Message Processor proxy latencies following timeouts on Router and Message Processor may need to be adjusted
Router - Modify /opt/apigee/customer/application/router.properties file
#Router client timeout
conf_load_balancing_load.balancing.driver.server.keepalive.timeout=<value in milliseconds>
#Router Message Processor timeout
conf_load_balancing_load.balancing.driver.proxy.read.timeout=<value in milliseconds>
Message Processor - Modify /opt/apigee/customer/application/message-processor.properties file
#Message Processor target call timeout
conf_http_HTTPTransport.io.timeout.millis= <value in milliseconds>
Timeout values should be set such that MP Target timeout < Router MP timeout < Router client timeout.
Performance Tests and Data collection
Series of tests may be required to identify the root cause of a performance issue and fix it.
Performance tests
1.Breakpoint test
Identify the system break point by gradually adding load and sustaining it for
some period. Benchmark system breakpoint in an initial performance test run.
Typically these tests are conducted for two hours timeframe.
2.Stress Test
Push the load just below the breakpoint for an extended period. This test may
reveal thread blocking / contention issues and / or memory issues.
3.Endurance Test
Run performance test with load above expected average TPS in PROD for an
extended period. Typically these tests are conducted for 12 hours to 24 hours.
Primary focus for these tests is identifying JVM heap / memory management
issues, out of memory errors, etc.
Data Collection
Following set of data is critical to identify root cause of performance issues.
1.Jstack dumps for
Message Processor and Router JVMs -
Jstack dumps are thread dumps from the JVM using the tool “Jstack” bundled in
JDK. Take dumps every 2 minutes through out the test.
Problem Areas: Look for many threads with "BLOCKED" status in the
same stack trace. Thread blocks will happen within the system but if many
threads are getting blocked in the same stack trace in multiple Jstack dumps
through out the test then that indicates a problem area. If these threads are
blocked for the same object lock then it may require a code change to fix the
issue.
2.netstat dumps
- Netstat dumps will provide number of sockets
in different network status (ESTABLISHED, TIME_WAIT, CLOSE_WAIT). This needs to
be collected in both Router and Message Processor components.
Problem Areas: Look for too many sockets with CLOSE_WAIT, this indicates
application is not closing the sockets properly and eventually server will run
out of the maximum number of sockets allowed for the component.
3.Router and
Message Processor Logs - Make
sure the Message Processor and Router logging level is INFO or higher. Severe
performance degradation will be there if the logging level is DEBUG
Problem Areas: Too many errors getting generated could be potentially a
problem, further investigate the code. If Router logs has "Connection
Reset by Peer" error messages then that indicates backend or MP is taking
more time to process the transaction and Router is closing the http connection
before the response is ready on MP. Adjust the http timeouts on Router and MP
according to the API use case.
4.Heap dumps
- Take heap dumps at the beginning and end of
the test and compare them to identify any potential memory handling issues. It
is a good idea to take heap dumps when there are performance issues during the
test too. They are handy to identify memory leaks and other memory related
issues such as creating huge number / size of objects for every request and
throwing them away. Taking a heap dump can impact system performance, as it is
a stop of the world scenario for JVM. It should be done only when there are
performance issues or at the beginning and end of the test.
Tip – There are many tools for viewing heap dumps but they do not
work consistently on all heap dumps. Make sure the heap size displayed is
matching with JVM configuration. If heap size is too low then most likely the
tool is failing to parse the dump, try another one.
5.Jmap Class Histogram - sometimes it is hard to catch application memory usage pattern through heap dump as it gives a snapshot at that time. Class histogram data will give the number of objects created per class and memory occupied in the heap. These dumps don't impact system performance if the "live" option is not used for the Jmap command. More frequent (2 minutes) dumps are always better for analysis. The "live" option will trigger Full GC and will impact the performance.
Thread pool Tuning
Most of the private cloud customers may not require customization of Message Processor main thread pool configuration. Default max pool size is 100. If "Apigee-Main" thread pool is staying at the max limit and tasks are getting rejected by the pool then increasing the thread pool max limit may help to fix the problem. Expand the pool in the increments of "100" and run tests after each change.
If there is high number of threads in the pool then context switching and maintaining the threads will cause significant performance overhead. Reduce the keep-alive setting if there is high number of idle threads in the pool through out the test.
Adjust the Message Processor thread pool configuration in apigee/edge-message-processor/token/default.properties file
#Default value 60000, adjust this entry if there are too many idle threads during the test.
conf_threadpool_keepalive.time=<Value in milliseconds>
#Default value 100, increase in steps of 100 (200, 300 etc). Conduct performance test after each change.
conf_threadpool_maximum.pool.size=
#Default value is 10, when all the threads in the pool are busy, tasks will get pushed to this queue. Adjust this parameter along with max thread pool size for fine-tuning the thread pool.
conf_threadpool_pool.queue.size=
Garbage Collection Tuning
Add -verbose:gc argument to Router and Message Processor JVMs in performance test environment. This will log GC collection activities in system.out. Redirect the JVM process stdout to system.log by modifying the start up command. This gives verbose information on young, old generation and Full GC activities.
Start with default GC settings for the JVMs. Use the JVM heap settings specified in the first section of this article. If the GC collection time increases with duration of the test or there are many old generation GCs after sometime into the test then that indicates a problem in heap usage. This will impact system performance. It is recommended to try fine tuning JVM GC parameters as a first resort.
If the application is creating and destroying too many big objects, it may require a code change to reduce the number of objects and GC fine-tuning may not fix the issue.
JVM GC settings
6.Use CMS GC collection algorithm if the default algorithm is not working well for the set up
1.-XX:+UseConcMarkSweepGC
2.-XX:+CMSParallelRemarkEnabled
3.-XX:+CMSClassUnloadingEnabled
7.Disable explicit GC from the application (-XX:+DisableExplicitGC).
8.Print GC activity timestamp in stdout (-XX:+PrintGCDateStamps). This will help to identify the frequency of GC cycles.
9.Divide the heap equally to New and Old generations (-XX:NewRatio=1). This may need to be changed depending upon the duration of live objects in the heap for a customer set up.
10.Survivor ratio (-XX:SurvivorRatio=8). This is a good starting point and may need to be changed depending upon the application.
11.Use parallel threads for young generation collection(-XX:+UseParNewGC).
12.Number of parallel threads (-XX:ParallelGCThreads=2 for 8 core CPU). By default JVM allocates 1 thread per CPU core, it may not be required, start with allocating 1/4 CPU cores to GC threads.
13.Old generation collection initiation settings
(-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly).
This indicates initiate old generation collection when old generation heap is
75% full. This value may need to be fine-tuned based on the memory usage of the
application. "UseCMSInitiatingOccupancyOnly" should be set to enforce
XX:CMSInitiatingOccupancyFraction.
Note:- G1 algorithm - Java 1.7 and up has an option for a new garbage
collection algorithm named G1. This works well when the heap size is big (heap
> 3GB). Consider G1 algorithm if the heap size is high.
Cassandra
Find the optimized settings for Cassandra here http://docs.datastax.com/en/cassandra/2.0/cassandra/install/installRecommendSettings.html
Hi @akinadiyil
I've tried changing all the timeout settings for both router and message processor which you mention above, and I still see an HTTP 504 Gateway timeout after ~57000ms. I expect the response to complete in roughly 120 seconds, hence the need to augment the timeout settings.
I changed the config values using the best practices, as described here:
http://docs.apigee.com/private-cloud/latest/how-configure-edge
My config files look like this:
[root@ilapg application]# pwd /opt/apigee/customer/application [root@ilapg application]# ls message-processor.properties router.properties [root@ilapg application]# cat router.properties conf_router_ServerContainer.io.timeout.millis=620000 conf_router_Client.pool.iotimeout=610000 conf_http_HTTPTransport.io.timeout.millis=600000 [root@ilapg application]# cat message-processor.properties bin_setenv_min_mem=1024m bin_setenv_max_mem=6144m bin_setenv_max_permsize=1024m conf_nodejs_connect.ranges.denied= conf_http_HTTPTransport.io.timeout.millis=600000
When I restart the RMP, I do see the output saying that the above config values have been updated as expected. However, there is no effect on behavior.
What timeout setting am I missing here? I can't seem to find anything written about this in the documentation either. Thanks in advance for your help!
Best,
Chris
@Chris Covney Please modify below files for timeout value as mentioned in this article.
Router - apigee/edge-router/token/default.properties
Message Processor- apigee/edge-message-processor/token/default.properties
Hi @akinadiyil
Thanks for clarifying. Why are the Apigee docs advertising that the only to change settings is through the apigee/application/customer/*.properties files, when in fact for timeout settings one must edit the default values directly?
EDIT/UPDATE: I changed the values in the default.properties files to 600000 ms for each of the three timeout-related properties and bounced both the RMPs, and still no effect. The timeout still occurs after 57000ms. What is going? What setting are we missing here?
Best,
Chris
Thanks again for your help. I wanted to follow up with you to see if you were able to figure out which settings we are missing in order to increase the RMP timeout settings.
I've tried every combination of your instructions plus the official Apigee docs. I still cannot determine how to change the timeout settings. Please let me know if you determine the missing step. Thanks!
Best,
Chris
@Chris Covney - I have updated the Router timeout properties. Looks like there was a change in these property names. Please try these properties and let me know incase of any issues.
Is it advisable to edit the message processor start script to add the parameters for Garbage Collection or is there any other better way/best practice for achieving this?