Troubleshooting: SOAP Connection errors with many TCP sockets in a CLOSE_WAIT state (Red Hat Enterprise Linux)
When running activities in production on Red Hat Enterprise Linux, you see errors in the Clipboard tool and in the Pega log. Examining the details of the connections from the client machine to the SOAP service endpoints reveals tens of thousands of TCP sockets in a Close_Wait state. Further analysis determines that this situation is the result of Axis2 connection pooling implemented by HFix-6324 and HFix-6912.
Errors
In the Clipboard tool
'java.net.BindException'
'ResourceUnavailableException'
'SOAP Fault Reason: Cannot assign requested address'
In the Pega log
'SOAP Fault Reason: Cannot assign requested address'
'java.io.IOException: Invalid keystore format'
'java.io.EOFException'
Explanation
Configuration of the Axis2 connection pooling is set too high by default, and, in the case of this operating system (Red Hat Enterprise Linux), the TCP sockets are not reclaimed after the specified timeout. This observed behavior is dependent on the client-side operating system where PRPC is running.
Consequently, the operating system is overrun with TCP sockets that cannot be reused and are allocated to the connection pool managed by Axis2. All endpoints are SOAP services in use by the PRPC application.
Suggested approach
To resolve the issue, specify the following Dynamic System Settings:
Pega-IntegrationEngine axis2_max_connections 1000
Pega-IntegrationEngine axis2_max_hostconnections 100
This limits the overall allocation of any TCP sockets to the axis2_max_connections. Any one host is allowed 100 active connections.
When the connection is closed, the TCP socket enters a CLOSE_WAIT state in response from the endpoint sending a FIN in response to the CLOSE. This persists until a new connection is required. Then the socket is released by Axis2 by sending a LAST_ACK, and the socket is immediately reused for the new connection request. On some operating systems, the LAST_ACK is sent or the TCP socket is reclaimed.
Axis2 HTTP Connection Pooling
Following is a summary of how Axis2 HTTP Connection Pooling works. It assumes that the connection uses an HttpConnection object.
- Initially, the number of connections in the pool is 0.
- When a request for a connection is made, a connection is returned following this sequence:
- First, the connection pool specific to the host (hostPool) is checked for any free connections.
- If available, a free connection is returned from this pool. Otherwise, the next condition is evaluated.
- If the number of connections in hostPool is less than MaxHostConnections and number of connections in the global connection pool (connectionPool) is less than MaxTotalConnections, a new connection is created and added to hostPool and connectionPool. Otherwise, the next condition is evaluated.
- If the number of connections in hostPool is less than MaxHostConnections and connectionPool has free connections but has reached the MaxTotalConnections, the least used connection is deleted from the pool and a new connection is created and added to hostPool and connectionPool. Otherwise, enter wait state.
- Enter waiting state.
- The request waits as long as specified by the timeout value. The default is 30 seconds.
- When the timeout occurs, an exception is thrown.
- First, the connection pool specific to the host (hostPool) is checked for any free connections.
- A connection has an associated socket.
The Response Timeout configured in the Connect SOAP rule is also used as the Socket Timeout.
- At the end of the request, ServiceClient.cleanupTransport() is called.
This releases the connection back to the pool. The TCP Socket associated with the connection is not closed. It remains open as long as the platform does not reclaim it, or the corresponding connection is reestablished, whichever occurs first.
If the connection from the pool is reestablished, the underlying socket is checked for staleness.
If the operating system does not see the socket in CLOSE_WAIT state, the socket is reused.
- On Linux, you can observe that a socket remains in ESTABLISHED state for approximately two minutes after the corresponding HTTP connection is released to the pool.
At the end of two minutes, the socket timeout occurs, and its state is changed to CLOSE_WAIT. It remains in CLOSE_WAIT state for another 2 minutes before it is reclaimed by the operating system.
- On UNIX and Linux, the maximum number of open sockets allowed per process at any given point is 1024.
You can customize this setting, but doing so is risky, not a best practice.
- Therefore, the best practice is to reduce the Dynamic System Setting values for Axis2_Max_HostConnections to 100 and for Axis2_Max_Connections to 1000.
See the References section for more information.
Additional information
How to debug SOAP Connect failure, ResourceUnavailableException or RemoteApplicationException