Detect problems with Cassandra nodes by analyzing the operating system (OS)
metrics.
- vmstat
- Identifies IO bottlenecks.
- In the following example, the wait-io (wa) value is higher than ideal and is
likely contributing to poor read/write latencies. The output of this command
over a period of time with high latencies can show you if you are IO bound
and if that may be a possible cause of
latencies.
root@ip-10-123-5-62:/usr/local/tomcat# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 4 0 264572 32008 15463144 0 0 740 792 0 0 6 1 91 2 0 2 3 0 309336 32116 15421616 0 0 55351 109323 59250 89396 13 2 72 13 0 2 2 0 241636 32212 15487008 0 0 57742 50110 61974 89405 13 2 78 7 0
2 0 0 230800 32632 15498648 0 0 63669 11770 64727 98502 15 3 80 2 0 3 2 0 270736 32736 15456960 0 0 64370 94056 62870 94746 13 3 75 9 0
- Netstat -anp | grep 9042
- Shows if network buffers are building up.
- The second and third columns in the output show the tcp Recv and Send buffer
sizes. Consistently large numbers for these values indicate the inability of
either the local Cassandra node or the client to handle processing of the
network traffic. See the following sample
output:
root@ip-10-123-5-62:/usr/local/tomcat# netstat -anp | grep 9042
tcp 0 0 10.123.5.62:9042 0.0.0.0:* LISTEN 475/java
tcp 0 0 10.123.5.62:9042 10.123.5.58:36826 ESTABLISHED 475/java
tcp 0 0 10.123.5.62:9042 10.123.5.19:54058 ESTABLISHED 475/java
tcp 0 138 10.123.5.62:9042 10.123.5.36:38972 ESTABLISHED 475/java
tcp 0 0 10.123.5.62:9042 10.123.5.75:50436 ESTABLISHED 475/java
tcp 0 0 10.123.5.62:9042 10.123.5.23:46142 ESTABLISHED 475/java
- Log files
- Shows the reasons why Cassandra has stopped working on the node. Usually
provided in the /var/log/* directory.
- In some cases the process might have been killed by the OS to prevent
system from bigger failure caused by lack of resources. Common case is lack
of memory which is indicated by the appearance of
OOMKiller message in logs.