Troubleshooting: WebSphere 5.1 Application Server hangs
Symptom
When this situation occurs, you may observe intermittent hanging of Application Server, with no obvious exception or error messages in log files leading up to, or during, the system hang condition. The application server has to be stopped and restarted to return service to users.
In the case of WebSphere, when the Node Agent detects that it can no longer communicate with the offending Application Server due to the hanging condition, the Node Agent automatically restarts said Application Server to restore service (subject to site-specific configuration).
Solution
Resolution
To tell whether you have encountered this known issue, enable and collate IBM’s Must Gather information the next time the system experiences a hang.
Of particular importance are steps 7-12 to capture thread dumps when the system exhibits the hanging behavior.
7. kill -3 [PID_of_hung_JVM]
8. Wait two minutes.
9. kill -3 [PID_of_hung_JVM]
10. Wait two minutes.
11. kill -3 [PID_of_hung_JVM]
12. Wait two minutes.
The kill -3 command creates javacore.txt files in the install_root, install_root/bin or in the configured working directory. These files are important to help analyze what may be causing the issue observed.
Log files requested to assist diagnosis:
- PegaRULES.log
- SystemErr.log
- SystemOut.log
- native_stderr.log
- native_stdout.log
- All Node agent log files
The issue is diagnosed via capture of thread dump information at the time of the Application Server hang. The example thread dump extract below (from the javacore.txt file) shows information for the deadlocked threads causing the Application Server to hang:
1LKREGMONDUMP JVM System Monitor Dump (registered monitors):
2LKREGMON JITC CHA lock (0x750CA748): owner
"Servlet.Engine.Transports : 3" (0x86071C20), entry count 3
3LKWAITERQ Waiting to enter:
3LKWAITER "Servlet.Engine.Transports : 0" (0x82513AA0)
2LKREGMON JITC MB UPDATE lock (0x75BA60E8): <unowned>
2LKREGMON JITC Global_Compile lock (0x75BA6038): <unowned>
2LKREGMON Integer lock access-lock (0x750CA5B8): <unowned>
2LKREGMON Free Class Loader Cache Entry lock (0x30253CA8):
<unowned>
2LKREGMON IO lock (0x30253BF8): <unowned>
2LKREGMON Evacuation Region lock (0x30253A98): <unowned>
2LKREGMON Heap Promotion lock (0x302539E8): <unowned>
2LKREGMON Sleep lock (0x30253938): <unowned>
3LKNOTIFYQ Waiting to be notified:
3LKWAITNOTIFY "PegaRULES Usage Tracking Daemon" (0x7CE34020)
3LKWAITNOTIFY "Thread-8" (0x7A037120)
3LKWAITNOTIFY "Thread-7" (0x7A036B20)
3LKWAITNOTIFY "Thread-22" (0x7B815B20)
3LKWAITNOTIFY "Thread-10" (0x7AC7C620)
3LKWAITNOTIFY "Thread-25" (0x7F04A8A0)
3LKWAITNOTIFY "EMAIL-Thread-695" (0x97F83420)
IBM confirms this matches a known issue logged under APAR PK12462 – “Deadlock in JIT between CHA & Global Compile Lock”, which is fixed in IBM SDK 1.4.2 Service Release 4.
Additional Resources
- Latest Fixes List of IBM Developer Kits (32-bit) for AIX, Java Technology edition 1.4.2
- IBM WebSphere Application Server SDK Latest Interim Fix for V5.1.1
- IBM APAR PK12462
Article originally written 10/9/2006.