Message-processor/management server has many connections in FIN_WAIT2 state hanging and impacting the health of the system.

Case: OPDK: 4.15.01.00 on RHEL 6.6

A situation where management-server is opening TCP connections to message-processor port xxxx. As a result the message-processor server has about 2400 connections in FIN_WAIT2.

1. FIN_WAIT2 connections are released only if we restart message-processor, they do not timeout.

2. Timeout setting are:

cat /proc/sys/net/ipv4/tcp_fin_timeout

60

3. Connectivity is fine between the 2 machines.

4. Curl call from the management server to MP gives a 200 OK.

curl -v http://<MP>:8082/v1/runtime/organizations

0 1 345
1 REPLY 1

Similar issues are seen on systems with Nginx as backend.

Management-server appears to not be returning a FIN, which should have put the connection into TIME_WAIT and eventually closed. In this specific scenario, the message-processor would be the server, and the management-server would be the client.

There's a code, which we have called "reaper," added into the Product which will act to remove stale connections. There are a variety of scenarios we've seen similar behavior, although in all of those cases they have been related to target/backend server issues.

This code was made available in 4,15.04, so to leverage this would require an upgrade. Also, it's possible this is related to the OS version, as RHEL 6.6 only became officially supported very recently, in 4.15.04.01.

An upgrade is highly recommend in this scenario.

Any more inputs @Alex Toombs ?