Kafka End-to-End Monitoring

SPOUD
7 min readJan 15, 2025

--

If we had to imagine an organization’s software landscape as a living organism, then surely Apache Kafka would take on the role of its nervous system. Kafka streams information between various applications and services, like nerves relay signals between different body parts. Some services function as “producers”, generating messages on various topics, while others act as “consumers”, eagerly awaiting and processing the messages they need.

If this nervous system is healthy, information flows seamlessly and timely between producers and consumers. This enables us to build distributed systems that are easily scalable, fault-tolerant, and capable of delivering real-time insights where those are needed. The same would be challenging to achieve with traditional synchronous communication via, for example, HTTP APIs, Remote Procedure Calls or batch jobs.

On the other hand, if we are heavily invested in this nervous system and it becomes unhealthy, the whole organization will suffer. Therefore, our clients often ask us how to best monitor a Kafka Cluster’s health. This first blog post will look at the most basic monitoring technique we believe everyone should employ: Kafka End-to-End Monitoring.

Kafka e2e Monitoring

What is End-to-End (E2E) Monitoring?

There are many technical details that we could monitor, but what we really want to know is how well our Kafka Cluster performs at the one task it was built for: receiving event data from producers and making it available for consumption by the consumers. End-to-end monitoring precisely monitors how well data flows from one end (a producer client) to the other end (a consumer client).

In practice, end-to-end monitoring means that you deploy your own dummy producer and have it write data to Kafka. On the other end, you have a dummy consumer consuming this data. Both clients are deployed in the environment where your organization typically deploys its applications (e.g. a Kubernetes cluster). This ensures that the dummy clients will connect to the Kafka Cluster using the same infrastructure supporting real business use cases. If one of the dummy clients detects errors while trying to produce or consume from Kafka, this likely indicates an issue affecting real producers and consumers.

This monitoring approach has the advantage of testing not just your Kafka cluster’s performance but also that of all infrastructure components sitting between the cluster and a producer/consumer. If you monitored just the metrics on the Kafka Broker, you would not catch issues arising from a failing proxy or DNS issues. A synthetic producer/consumer, on the other hand, will experience these problems like every other producer/consumer.

As a nice side effect, this also validates the whole deployment process, the process of onboarding new applications when you start your Kafka journey and gives your operations team a taste of the experience that an application developer gets when using your platform.

e2e Monitoring

This style of monitoring is very widely scoped. It is excellent at detecting that something is going wrong. At the same time, this broad scope makes it unable to tell you where the problem lies and what is causing it. For this reason, we recommend employing E2E monitoring as a way to get an idea of the kind of performance currently provided by the Kafka platform (and the surrounding infrastructure) and then supplement this approach with a more focused look at Kafka broker metrics (a topic for a future blog post).

E2E Monitoring for Kafka in Practice

Since end-to-end monitoring is a frequent requirement, we (or our clients) have had to implement it ourselves in almost every project involving setting up Kafka infrastructure. One way to start with E2E monitoring is to write that dummy client yourself. In our projects, this takes the form of, for example, custom Python scripts that produce and consume messages in an infinite loop.

Another way to accomplish this is to produce and consume using the kcat utility. kcat's logs can be captured and monitored for errors. We can also run kcat and then collect BytesInPerSec and BytesOutPerSec broker metrics for our test topic. A decrease in either metric may indicate issues producing/consuming messages to the Kafka Cluster. In Confluent Cloud, a similar effect is achieved more easily using the Metrics API. For example, we can write a bash script that runs kcat every minute to produce a message to Kafka. We then query the metrics API to verify that one message hits our test topic every minute.

These kinds of solutions are straightforward to implement, but they allow only minimal insight into the experience that the users of our Kafka cluster are getting. Essentially, we are only able to verify whether the Cluster is reachable or not. A more sophisticated solution is provided by LinkedIn’s popular Xinfra Monitor (formerly known as Kafka Monitor). Unfortunately, the project has not been maintained for the last few years. This is problematic because the Xinfra Monitor depends on Zookeeper. Zookeeper has been deprecated in favor of Zookeeper-less KRaft mode Kafka clusters. With support for Zookeeper scheduled for removal in the upcoming Kafka version 4.0, many of our clients are already in the process of migrating away from this technology. That is why we have set out to implement our own reusable solution for E2E monitoring: kafka-synth-client.

Kafka Synth Client is (as the name suggests) a synthetic client for Kafka. It is an application that produces and consumes messages while exposing several metrics that can be scraped by Prometheus. Specifically, the following metrics are available:

  • synth_client_e2e_latency_ms exposes the milliseconds elapsed between the time send was called on the producer and the time the message was received on the consumer's side. The synth client provides 50th, 95th, and 99th percentile values for this latency.
  • synth_client_ack_latency_ms measures the time the Kafka cluster takes to acknowledge a produced message. This is especially interesting when the Kafka producer is configured with acks=all, as this will instruct the Kafka broker to replicate the received message before acknowledging it.
  • synth_client_time_since_last_consumption_seconds displays the seconds elapsed since the last time the synth client consumed a record. Since the synth client produces records once per second, values much greater than one could indicate performance degradation or connectivity issues. This metric is a good candidate for alerting.
  • synth_client_producer_error_rate is a re-export of Kafka producer's record-error-rate metric which tracks the average per-second number of record sends that resulted in errors. This is another great candidate for alerting.

Given the permissions to describe the Kafka cluster, our client will detect the number of running brokers and increase the number of partitions on the test topic to match this amount. It will also perform partition re-assignment to ensure each broker is a leader for at least one partition. This ensures that the E2E test interacts with every broker on the cluster. The E2E latency metric described above then displays latency information by broker, which allows us to detect performance degradation on individual brokers.

Suppose your cluster is shared by multiple environments (e.g. on-premise in Zurich and in AWS in the eu-west-1 region). In that case, you may want to deploy a Kafka synth client in each environment and assign each a different rack ID (using the SYNTH_CLIENT_RACK environment variable). This will cause each synth client to report latencies for messages that are produced in one region and consumed in another one:

synth_client_e2e_latency_ms{broker="0",fromRack="onprem-zurich",partition="0",toRack="eu-west-1",quantile="0.5",} 27.445
synth_client_e2e_latency_ms{broker="0",fromRack="eu-west-1",partition="0",toRack="eu-west-1",quantile="0.5",} 11.125

When monitoring the latencies of multiple synth clients on different machines, it is essential to ensure that their clocks are synchronized. A failure to do so will result in reported latency values being inaccurate. For example, if the clock on the client in Zurich is one second behind the clock on the client in AWS, then the latency of a message produced in Zurich and consumed in eu-west-1 will be reported as 1000 milliseconds higher than it actually is. Conversely, the latency of a message produced in eu-west-1 and consumed in Zurich will be reported as 1000 milliseconds lower than it is (potentially even dipping into negative values, which would be quite non-sensical). To mitigate this issue, the synth client is configured to synchronize its clock with some NTP server regularly (by default, time.google.com). The reported latencies should be accurate if all synth clients are configured to synchronize their clocks with the same NTP server.

If you want to try out kafka-synth-client, follow the instructions in our git repository. The application runs in a single container and requires no external dependencies (except, of course, for the Kafka cluster you want to monitor).

Conclusion

In this post, we have looked at the foundational technique for monitoring your Kafka cluster: E2E monitoring. We encourage you to constantly monitor your ability to access and use Kafka from environments where your applications usually run, as this approach will be the fastest to uncover ongoing incidents. While this method is great for detecting issues, it does not help much when looking for the source of the issues. In the next blog post, we will look at metrics that should be monitored on the Kafka Broker side to detect specific problems.

--

--

No responses yet