Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (2024)

In this article, we will explore the critical metrics essential for monitoring Apache Kafka effectively. Understanding and tracking these key metrics are crucial for ensuring the performance, reliability, and scalability of your Kafka clusters in real-time data processing environments.

Table of Contents

What is Apache Kafka?

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (1)

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It’s like a highly efficient and scalable messaging system that can handle large volumes of data in real-time.

Apache Kafka Architecture

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (2)

Let’s break down the components and their interaction using Zomato, a food delivery app, as an example:

Producers:

In Kafka, producers are processes or applications that publish streams of data (records) to Kafka topics.
In Zomato’s case, various services could act as producers:
- An order placement service might publish a stream of records whenever a new order is created. This record could include details like customer ID, restaurant ID, and order items.
- A real-time location service might publish updates on the location of delivery personnel.

Brokers:

Kafka brokers are servers that store the published streams of records. They act as the central nervous system of the Kafka architecture.
Zomato would likely run a cluster of Kafka brokers to handle the high volume of data generated by its various services.

Topics:

Important Metrics to Monitor in Kafka

A few metrics are super important to have:

Number of active controllers: should always be 1

Metric-kafka_controller_kafkacontroller_activecontrollercount

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (3)

Number of underreplicated partitions: should always be 0

Metric-kafka_cluster_partition_underreplicated

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (4)

Number of offline partitions: should always be 0

Metric-kafka_controller_kafkacontroller_offlinepartitionscount

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (5)

Apache Kafka Metrics

Kafka metrics can be broken down into three categories:

Kafka server (broker) metrics
Kafka Producer metrics
Kafka Consumer metrics
Zookeeper metrics
JVM Metrics

1.Broker Metrics

Monitoring and alerting on issues as they emerge in your broker cluster is critical since all messages must pass through a Kafka broker to be consumed.

Key Broker Metrics:

Topic Activity: Track the volume of messages being produced and consumed across different topics. This helps identify popular topics, potential bottlenecks, and overall cluster load.

Broker Performance: Monitor key broker metrics like CPU, memory usage, and network I/O. This allows you to identify overloaded brokers and potential resource constraints.

Replication: Ensure data integrity and redundancy by monitoring replication metrics. These metrics track the flow of data copies between replicas and identify any replication lags or failures.

Consumer Groups: Gain insights into consumer group behavior. Monitor metrics like consumer offsets and lag to ensure consumers are actively processing messages and identify any lagging consumers.

Errors: Quickly identify and troubleshoot issues by monitoring error metrics. These metrics track errors like produce request failures, fetch request failures, and invalid message formats.

Name	Description
UnderReplicatedPartitions	The number of under-replicated partitions across all topics on the broker. Under-replicated partition metrics are a leading indicator of one or more brokers being unavailable.
IsrShrinksPerSec/IsrExpandsPerSec	If a broker goes down, in-sync replica ISRs for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up.
ActiveControllerCount	Indicates whether the broker is active and should always be equal to 1 since there is only one broker at the same time that acts as a controller.
OfflinePartitionsCount	The number of partitions that don’t have an active leader and are hence not writable or readable. A non-zero value indicates that brokers are not available.
LeaderElectionRateAndTimeMs	A partition leader election happens when ZooKeeper is not able to connect with the leader. This metric may indicate a broker is unavailable.
UncleanLeaderElectionsPerSec	A leader may be chosen from out-of-sync replicas if the broker which is the leader of the partition is unavailable and a new leader needs to be elected. This metric can indicate potential message loss.
TotalTimeMs	The time is taken to process the message.
PurgatorySize	The size of purgatory requests. Can help identify the main causes of the delay.
BytesInPerSec/BytesOutPerSec	The number of data brokers received from producers and the number that consumers read from brokers. This is an indicator of the overall throughput or workload in the Kafka cluster.
RequestsPerSecond	Frequency of requests from producers, consumers, and subscribers.

2.Producer Metrics

Producer metrics provide valuable insights into the behavior and performance of applications sending messages to your Kafka cluster.

Key Producer Metrics:

Message Production Rate: The number of messages produced per second by the producer application. This helps gauge the overall message volume being sent to Kafka.
Batch Size: The average size of message batches sent by the producer. Larger batches can improve throughput, but finding the optimal size depends on factors like topic replication and network latency.
Delivery Rate: The rate at which messages are successfully delivered to Kafka brokers. This metric helps identify any bottlenecks or delays in the message production pipeline.
Latency: The time it takes for a message to be sent from the producer to the Kafka broker. Analyzing latency can reveal potential issues like network congestion or overloaded brokers.
Producer Errors: Track errors encountered by the producer, such as produce request failures or serialization errors. Identifying these errors can help diagnose and fix issues with the producer application.

Name	Description
compression-rate-avg	Average compression rate of sent batches.
response-rate	An average number of responses received per producer.
request-rate	An average number of responses sent per producer.
request-latency-avg	Average request latency in milliseconds.
outgoing-byte-rate	An average number of outgoing bytes per second.
io-wait-time-ns-avg	The average length of time the I/O thread spent waiting for a socket (in ns).
batch-size-avg	The average number of bytes sent per partition per request.

3.Consumer Metrics

Consumer metrics are crucial for understanding how efficiently your applications are processing messages from Kafka topics.

Consumer metrics offer a window into various aspects of your Kafka consumers, including:

Consumption Rate: Track the number of messages a consumer is processing per second. This helps gauge overall processing efficiency and identify consumers that might be falling behind.
Fetch Behavior: Monitor metrics like fetch size and frequency to understand how consumers are requesting data from brokers. This can reveal potential inefficiencies in data fetching strategies.
Offsets: Track consumer offsets to determine their progress within a topic partition. Offsets indicate the last message a consumer has successfully processed. Lagging offsets could signal slow processing or consumer failures.
Commit Intervals: Monitor how often consumers commit their offsets to Kafka. Frequent commits ensure timely processing updates but can introduce additional overhead. Conversely, infrequent commits might lead to data loss during consumer failures.
Errors: Identify and diagnose issues related to message consumption. Consumer error metrics might reveal problems like invalid messages, network errors, or timeouts.

Name	Description
records-lag	The number of messages consumer is behind the producer on this partition.
records-lag-max	Maximum record lag. Increasing value means that the consumer is not keeping up with the producers.
bytes-consumed-rate	Average bytes consumed per second for each consumer for a specific topic or across all topics.
records-consumed-rate	An average number of records consumed per second for a specific topic or across all topics.
fetch-rate	The number of fetch requests per second from the consumer.

4.Zookeeper metrics

ZooKeeper, the crucial distributed coordination service for many Kafka deployments, also offers a rich set of metrics to monitor its health and performance.

Categories of ZooKeeper metrics:

Cluster State: Monitor metrics like the number of active servers, followers, and observers in your ZooKeeper ensemble. This ensures quorum health and identifies potential issues like server outages or connectivity problems.

Request Processing: Track metrics like the number of requests per second (reads, writes), request latencies, and failed requests. This helps identify overloaded servers or potential bottlenecks within ZooKeeper.

Watcher Performance: Watchers are a core ZooKeeper feature for notifications on data changes. Monitor metrics like the number of watchers and average watch event latency to ensure efficient change notification mechanisms.

Synchronization: ZooKeeper uses synchronization primitives like locks. Track metrics like lock acquisition times and contention rates to identify potential synchronization bottlenecks in your applications.

Name	Description
outstanding-requests	The number of requests that are in the queue.
avg-latency	The response time to a client request is in milliseconds.
num-alive-connections	The number of clients connected to ZooKeeper.
followers	The number of active followers.
pending-syncs	The number of pending consumers syncs.
open-file-descriptor-count	The number of used file descriptors.

5.JVM Metrics

While Kafka itself provides valuable metrics, the underlying JVM (Java Virtual Machine) offers another crucial layer of monitoring for your Kafka deployment. JVM metrics expose insights into the health and performance of the Java environment running your Kafka.

Memory Usage: Track metrics like heap memory usage, non-heap memory usage, and garbage collection activity. This helps ensure sufficient memory allocation and identify potential memory leaks or excessive garbage collection overhead impacting Kafka’s performance.

Threading: Monitor metrics like thread count, CPU usage by threads, and thread pool utilization. This helps identify potential thread starvation or overloaded thread pools, ensuring efficient resource allocation for Kafka tasks.

Class Loading: Track metrics like the number of loaded classes and class loading times. This helps identify issues with classpath configuration or excessive class loading impacting application startup times.

File Descriptors: Monitor the number of open file descriptors to identify potential resource exhaustion and ensure proper file descriptor management within the Kafka brokers.

JVM garbage collector metrics

Name	Description
CollectionCount	The total number of young or old garbage collection processes performed by the JVM.
CollectionTime	The total amount of time in milliseconds that the JVM spent executing young or old garbage collection processes.

Host metrics

Name	Description
Page cache reads ratio	The ratio of the number of reads from the cache pages and the number of reads from the disk.
Disk usage	The amount of used and available disk space.
CPU usage	The CPU is rarely the source of performance issues. However, if you see spikes in CPU usage, this metric should be investigated.
Network bytes sent/received	The amount of incoming and outgoing network traffic.

Prometheus provides Kafka metrics file using jmx_exporter in below official prometheus jmx_exporter official GitHub repository. For this setup, we’ll use the kafka-2_0_0.yml sample configuration.

lowercaseOutputName: truerules:
# Special cases and very specific rules
- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
 name: kafka_server_$1_$2
 type: GAUGE
 labels:
 clientId: "$3"
 topic: "$4"
 partition: "$5"
- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
 name: kafka_server_$1_$2
 type: GAUGE
 labels:
 clientId: "$3"
 broker: "$4:$5"
- pattern : kafka.coordinator.(\w+)<type=(.+), name=(.+)><>Value
 name: kafka_coordinator_$1_$2_$3
 type: GAUGE
# Generic per-second counters with 0-2 key/value pairs
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+), (.+)=(.+)><>Count
 name: kafka_$1_$2_$3_total
 type: COUNTER
 labels:
 "$4": "$5"
 "$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+)><>Count
 name: kafka_$1_$2_$3_total
 type: COUNTER
 labels:
 "$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*><>Count
 name: kafka_$1_$2_$3_total
 type: COUNTER
# Quota specific rules
- pattern: kafka.server<type=(.+), user=(.+), client-id=(.+)><>([a-z-]+)
 name: kafka_server_quota_$4
 type: GAUGE
 labels:
 resource: "$1"
 user: "$2"
 clientId: "$3"
- pattern: kafka.server<type=(.+), client-id=(.+)><>([a-z-]+)
 name: kafka_server_quota_$3
 type: GAUGE
 labels:
 resource: "$1"
 clientId: "$2"
- pattern: kafka.server<type=(.+), user=(.+)><>([a-z-]+)
 name: kafka_server_quota_$3
 type: GAUGE
 labels:
 resource: "$1"
 user: "$2"
# Generic gauges with 0-2 key/value pairs
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Value
 name: kafka_$1_$2_$3
 type: GAUGE
 labels:
 "$4": "$5"
 "$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Value
 name: kafka_$1_$2_$3
 type: GAUGE
 labels:
 "$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>Value
 name: kafka_$1_$2_$3
 type: GAUGE
# Emulate Prometheus 'Summary' metrics for the exported 'Histogram's.
#
# Note that these are missing the '_sum' metric!
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Count
 name: kafka_$1_$2_$3_count
 type: COUNTER
 labels:
 "$4": "$5"
 "$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*), (.+)=(.+)><>(\d+)thPercentile
 name: kafka_$1_$2_$3
 type: GAUGE
 labels:
 "$4": "$5"
 "$6": "$7"
 quantile: "0.$8"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Count
 name: kafka_$1_$2_$3_count
 type: COUNTER
 labels:
 "$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*)><>(\d+)thPercentile
 name: kafka_$1_$2_$3
 type: GAUGE
 labels:
 "$4": "$5"
 quantile: "0.$6"
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>Count
 name: kafka_$1_$2_$3_count
 type: COUNTER
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>(\d+)thPercentile
 name: kafka_$1_$2_$3
 type: GAUGE
 labels:
 quantile: "0.$4"
# Generic gauges for MeanRate Percent
# Ex) kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>MeanRate
- pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>MeanRate
 name: kafka_$1_$2_$3_percent
 type: GAUGE
- pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>Value
 name: kafka_$1_$2_$3_percent
 type: GAUGE
- pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*, (.+)=(.+)><>Value
 name: kafka_$1_$2_$3_percent
 type: GAUGE
 labels:
 "$4": "$5"

Conclusion:

In conclusion, monitoring Apache Kafka involves tracking essential metrics across brokers, producers, consumers, and ZooKeeper, ensuring optimal performance and reliability in real-time data processing environments. By focusing on these key metrics, organizations can proactively manage Kafka clusters and maintain high availability for their streaming applications.

Reference:-

For reference visit theofficial website.

Any queries pls contact us @Fosstechnix.com.

Related Articles:

Install Apache Kafka and Zookeeper on Ubuntu 24.04 LTS

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (2024)

FAQs

How do you prepare to monitor Kafka with JMX metrics? ›

Configure JMX

server.hostname Specifies the host a JMX client connects to. The default is localhost ( 127.0.0.1 )
jmxremote=true Enables the JMX remote agent and enables the connector to listen through a specific port.
jmxremote. authenticate=false Indicates that authentication is off by default.
jmxremote.

Learn More ›

What are the metrics of Kafka monitoring? ›

Summary of key Kafka monitoring concepts

Key metrics include message throughput, broker resource utilization, consumer lag, and latency. Collecting and analyzing metrics is essential for identifying and troubleshooting issues, optimizing performance, and meeting SLOs and SLAs.

What is the difference between Kafka exporter and JMX exporter? ›

If you are unfamiliar with them, JMX Exporter gives you the metrics of each individual broker, such as memory, GC and Kafka external metrics ( kafkajmx. * in Wavefront), while Kafka Exporter gives you the metrics of the overall state in the cluster, such as the offsets of partitions ( kafka. * in Wavefront).

Explore More ›

Which metric should you monitor on a Kafka producer to determine how many acknowledgments it is receiving per second? ›

Metric to watch: Response rate

For producers, the response rate represents the rate of responses received from brokers. Brokers respond to producers when the data has been received.

Discover More ›

Which metrics can you monitor with a JMX extension? ›

JMX metrics are available for all Java-based processes monitored by OneAgent. Once your extension is uploaded, Dynatrace automatically begins querying the defined metrics for all Java processes. To find the metrics, go to a relevant process page and click Further details.

Discover More ›

How to check JMX metrics? ›

Open the JMX panel to view the metrics.

Click Connect in the New Connection dialog. The JMX panel opens.
Open the MBeans tab and expand com. genesyslab. gemc. metrics. All of the Web Engagement metrics are there.
To refresh the metrics, click Refresh.

Oct 31, 2019

Know More ›

What are the key performance indicators of Kafka? ›

The four main metrics Kafka provides are the Kafka server (broker) metrics, Producer metrics, Consumer metrics, and ZooKeeper metrics. These metrics assist in monitoring Kafka and resolving issues before they become more serious. In this article, we will explore the Importance of Monitoring Kafka Performance.

Learn More ›

What are metrics in monitoring? ›

An effective monitoring system collects data, aggregates it, stores it, visualizes metrics, and alerts you about any problems in your systems. Metrics are the basic values used to understand historical trends, compare various factors, identify patterns and anomalies, and find errors and problems.

Tell Me More ›

What are the best monitoring tools for Apache Kafka? ›

Summary of popular Kafka monitoring tools

Tool	Key strength
Prometheus with Kafka Exporter	Excellent metric visualization and querying capabilities.
Burrow	Specializes in monitoring Kafka consumer lag.
Confluent Control Center	Comprehensive cluster management and monitoring.

2 more rows

Learn More ›

How does JMX exporter work? ›

JMX Exporter uses Java's JMX mechanism to read the monitoring data of the JVM runtime, and then converts it into a metrics format that can be recognized by Prometheus, so that Prometheus can monitor and collect it. The parameters are specified when the JVM starts, and the RMI interface of JMX is exposed.

What is the port of JMX exporter in Kafka? ›

When JMX exporter is enabled, JMX port in kafka container is set to 5555 and jmx-exporter sidecar use it to collect the metrics and expose them in port 5556. Kafka commands use the same env var ( JMX_PORT ), so they will try to open the port configured for the server.

Show Me More ›

What is the use of Kafka exporter? ›

Kafka Exporter is an open source project to enhance monitoring of Apache Kafka brokers and clients. Kafka Exporter is provided with AMQ Streams for deployment with a Kafka cluster to extract additional metrics data from Kafka brokers related to offsets, consumer groups, consumer lag, and topics.

Tell Me More ›

How do I check Kafka metrics? ›

To monitor Kafka metrics use Grafana dashboards. First, you need to choose the type of dashboard that suits you and create it. Then choose a data source. Today the best source of data for Grafana is Graphite.

Discover More ›

How many messages can Kafka handle per second? ›

Kafka generally has better performance. If you are looking for more throughput, Kafka can go up to around 1,000,000 messages per second, whereas the throughput for RabbitMQ is around 4K-10K messages per second. This is due to the architecture, as Kafka was designed around throughput.

Metrics to Monitor in Kafka and Zookeeper using JMX Exporter (2024)

What is Apache Kafka?

Apache Kafka Architecture

Important Metrics to Monitor in Kafka

Apache Kafka Metrics

1.Broker Metrics

2.Producer Metrics

3.Consumer Metrics

4.Zookeeper metrics

5.JVM Metrics

FAQs

How do you prepare to monitor Kafka with JMX metrics? ›

What is the port of JMX exporter in Kafka? ›

References