Monitoring Kafka
Use Java Management Extensions (JMX) to gather Kafka metrics. JMX metrics are always available when Kafka is running.
The default JMX port is 9999. To change it, edit the following entry in the prconfig.xml file:
<env name="dsm/services/stream/pyJmxPort" value="portNumber" />
where portNumber is the custom port number. For more information, see Modifying the prconfig.xml file.
You can monitor the following metrics:
Area | Metric | Description |
---|---|---|
Disk | Free disk space | Total free storage space, calculated as the sum of free space in all storages where Kafka data is located. |
Disk usage | Total used storage space, calculated as the sum of Kafka data directories. | |
Partitions | Total | Number of partitions on this Kafka broker. The numbers should be similar across all brokers. |
Under-replicated | Partitions where the number of in-sync replicas is lower than the total number of replicas. An alert is registered if the value is greater than 0. | |
Offline | Number of partitions which do not have an active leader and which are therefore not writable or readable. An alert is registered if the value is greater than 0. | |
Leaders | Number of leaders on this Kafka broker. The numbers should be similar across all brokers. | |
Incoming byte rate | 1 minute | Incoming byte rate for the last minute. |
5 minute | Incoming byte rate for the last 5 minutes. | |
15 minute | Incoming byte rate for the last 15 minutes. | |
Mean | Aggregated incoming byte rate. | |
Outgoing byte rate | 1 minute | Outgoing byte rate for the last minute. |
5 minute | Outgoing byte rate for the last 5 minutes. | |
15 minute | Outgoing byte rate for the last 15 minutes. | |
Mean | Aggregated outgoing byte rate. | |
Incoming message rate | 1 minute | Incoming message rate for the last minute. |
5 minute | Incoming message rate for the last 5 minutes. | |
15 minute | Incoming message rate for the last 15 minutes. | |
Mean | Aggregated incoming message rate. | |
Processors | Network processors idle time | The average fraction of time the network processors are idle. The value should be between 0 and 1, ideally greater than 0.3. |
Request handler threads idle time | The average fraction of time the request handler threads are idle. The value should be between 0 and 1, ideally greater than 0.3. | |
Metrics | Replication max lag | Maximum lag in messages between the follower and leader replicas. |
Is controller | If the broker is an active controller, the value of this metric is 1. The aggregated sum across all brokers in the cluster should always be 1, because there must be exactly one controller per cluster. |
To view the main outline for this article, see Kafka as a streaming service.
Previous topic Kafka with Charlatan Next topic Troubleshooting Kafka