Monitoring the Stream service
Ensure that your Stream service operates without causing any errors by regularly monitoring the status of your Stream nodes, partitions, disk space, CPU usage, and database availability.
Starting in version 8.7, the use of an internal Kafka is deprecated. On-premises systems that have been updated from earlier versions of Pega Platform can continue to use Kafka in embedded mode. However, to ensure future compatibility, do not create any new environments using embedded Kafka. When configuring the Stream service in a new environment, use external Kafka.
When the Stream service is running in External Mode, the Stream landing page does not display any node details. Under normal conditions, only the NORMAL status is displayed, which means that Pega Platform is able to connect with External Kafka and use it for the Stream service.
You can do a simple, manual status check of the Stream service by going to the Stream landing page.
In the header of Dev Studio, click.
- Status of each stream node is NORMAL.
- There are no offline partitions.
- There are no under-replicated partitions.
You might see some under-replicated partitions from time to time, however, the value must stay 0 most of the time. If you constantly see under-replicated partitions, that might indicate that the Stream service is undersized. In this case, look at the CPU of your stream nodes. if CPU is averaging above 80%, consider scaling up your Stream nodes cluster.
- There is exactly one node for which the Is controller
parameter is set to 1.
If you see no controller nodes, or there is more than one controller node, perform a rolling restart of Stream nodes. A controller node is responsible for managing the states of partitions, replicas and for performing administrative tasks like reassigning partitions in case of a failure. For more information, see Operating the Stream service.
- The total number of partitions does not exceed 1,000.
Maximum number of partitions per broker
There is a known limit on the number of partitions per broker. The limit depends on the configuration of the machine that the broker is running on. As a general rule, limit the number of partitions per broker to 1,000. When more than 1,000 partitions per broker are configured, issues with the synchronization of partition replicas might occur and lead to under-replicated partitions. The limit of 1,000 partitions is valid for the internal Kafka in Pega Cloud. For the specific numbers in other environment types, see the following table:
Maximum supported partitions per broker in relation to CPU and memory
|Machine type||Maximum supported partitions per broker|
|CPU (core): 2, Memory (GiB): 2||300|
|CPU (core): 2, Memory (GiB): 8||1,000|
|CPU (core): 4, Memory (GiB): 16||1,000|
|CPU (core): 8, Memory (GiB): 32||2,000|
|CPU (core): 16, Memory (GiB): 64 and up||4,000|
Monitor Pega Platform instances by associating a health check with every node in a cluster. Take unhealthy nodes out of the service and restart them.
For more information, see Verifying that an instance is running.
The Stream service uses the disk to store data. Monitor disk usage to prevent running out of disk space. If a Stream node runs out of disk space, it stops functioning until space is available again. This situation leads to data loss and overall service instability.
Monitor the available disk space and take actions, for example, increase disk space, when the availability of free space drops below 30%.
Monitor the CPU usage on every Stream service node. If the CPU average usage is above 70%, consider provisioning a bigger node.
The Stream service relies on the availability of the Pega Platform database. A slow or unavailable database might have a serious impact on the Stream service.
Monitor queries to the following tables:
If queries take more than 1 second, consider tuning your database.
In the case of database unavailability, planned or unplanned, consider restarting your stream nodes, especially if you see them being unhealthy.
If you encounter the following issues while monitoring the stream service:
- Stream nodes are getting terminated.
- Stream node terminates and fails to start.
- The number of file descriptors in use increases constantly.
Note: This could be a side effect of partition count mismatch between DataFlow and Kafka.
``` 2022-02-16 10:37:14,737] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[?:1.8.0_144] at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) ~[?:1.8.0_144] at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[?:1.8.0_144] at kafka.network.Acceptor.accept(SocketServer.scala:414) ~[kafka_2.11-126.96.36.199.jar:?] at kafka.network.Acceptor.run(SocketServer.scala:357) [kafka_2.11-188.8.131.52.jar:?] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] ```
Previous topic File descriptor limits Next topic Operating the Stream service