This is the second blog post of a series about Debezium. In the first one, we’ve explored different aspects of data integrity when reading database changes with Debezium. This blog post focuses on monitoring and ensuring that a pipeline runs smoothly.
We assume that we have a Debezium connector that successfully reads changes from a database and writes them as events to Kafka. Debezium is a complex system, so we need information from its internals to tell whether it’s healthy and performant. Luckily, Kafka Connect and Debezium provide a mountain of metrics that give us insights into the data pipeline.
Debezium internals
Before diving into monitoring, we must understand some basics about Debezium’s architecture. To run a Debezium connector, you first need to install an appropriate Kafka Connect plugin. Debezium offers a plugin archive for each database system containing the main Debezium JAR file and all its dependencies. Among these dependencies, you’ll find a JAR called debezium-core
. The main functionality of every Debezium plugin resides in the core library. This means that the main functionality of each Debezium plugin is identical and independent of the database system.
On a high level, we can identify three main components in Debezium’s architecture. Kafka Connect creates an instance of a source task class, e.g. PostgresConnectorTask
or OracleConnectorTask
. During initialization, the source task creates an in-memory queue and an instance of ChangeEventSourceCoordinator
.
The most complex part of a Debezium connector is the ChangeEventSourceCoordinator
. It holds the connection to the source database, creates snapshots, reads changes, coordinates snapshot and streaming phase, transforms data into Debezium's internal format and more. Summarized, it's responsible for reading data events from the database. Once events have been prepared, they are appended to the queue. The queue is an in-memory buffer that works as a buffer between the ChangeEventSourceCoordinator
and the source task. The queue size is configurable, but once Debezium runs, the size is fixed. The source task (e.g. PostgresConnectorTask
or OracleConnectorTask
) reads events from the queue and passes them to a Kafka producer who writes them into a topic. Overall, the data flows from the source database to the ChangeEventSourceCoordinator
, through the queue to the source task, from there to the Kafka producer and finally to Kafka.
The deeper you dive into the code, the more differences you encounter. This blog post aims to provide an overview of monitoring Debezium. We will not discuss differences here. Instead, we stick to the high-level view, which provides enough insights to understand a basic set of metrics.
Monitoring
Every Debezium plugin provides a set of metrics around the three components we discussed above. Some metrics give an impression of the whole pipeline, while others concern the state of the ChangeEventSourceCoordinator
or the queue. In addition to the standard metric set, a plugin may produce database-specific metrics like the one for Oracle or MariaDB.
The metrics of a Debezium connector can be broken down into three categories:
- Generic Debezium metrics
- Database specific metrics
- Kafka Connect metrics
- Producer metrics
Every component in the Kafka environment produces metrics in JMX, and Debezium does too. Depending on the metric storage, you’d either configure a JMX scraper to read the metrics directly from Kafka Connect or run a JMX exporter sidecar that publishes the metrics in a Prometheus-friendly format.
We will not cover how to set up a monitoring pipeline or discuss database-specific metrics here, as doing so would go beyond the scope of this blog post.
Key metrics
Monitoring the queue
The queue between the ChangeEventSourceCoordinator
and the source task is a good starting point for building a dashboard. Depending on the usage of the queue over time, we can quickly decide whether the pipeline is unhealthy. With the metrics QueueTotalCapacity
and QueueRemainingCapacity
, you can calculate the number of items in the queue (QueueTotalCapacity - QueueRemainingCapacity
) or the queue usage as a percentage ((QueueTotalCapacity - QueueRemainingCapacity) / QueueTotalCapacity
).
There are a couple of scenarios that you should look out for.
The queue usage is on a constant level, occasionally empty
The queue is constantly used. The ChangeEventSourceCoordinator
and the source task are in equilibrium. Neither one is over- or underproductive.
The queue is always empty
An empty queue means the ChangeEventSourceCoordinator
doesn't receive any changes. There are two possible reasons for that:
- If the database doesn’t receive any data modification requests, this behaviour is to be expected. No changes in the database mean that Debezium won’t generate any change events, so there is no need to react.
- The case where the data is changed but Debezium is unaware of it is more delicate. The most likely reason for that is a configuration issue. Check the connectors configuration and the permissions of the user Debezium uses to read changes from the database.
The queue is almost always empty except for some occasional spikes
Depending on the usage of the database, this may be an expected scenario. For example, it’s common to have regular batch operations in banking software. Each operation may trigger significant changes that Debezium will pick up. Depending on the size of such a batch, the number of change events that Debezium sees can go into millions. Such occasions will temporarily fill the queue. The source task will start removing those events from the queue and producing them to Kafka, but the rate of arriving changes from the database is usually higher than the production rate to Kafka. However, as long as the task eventually brings the queue usage down to a regular level, this behaviour is normal.
The queue is always full
In this case, the ChangeEventSourceCoordinator
produces events faster than the task can consume. This behaviour shows that there is a bottleneck on the right side of the queue. To determine where exactly the bottleneck is, you will need to look into the producer's metrics and (if possible) the Kafka cluster.
The queue is used but primarily empty
The ChangeEventSourceCoordinator
produces events at a lower rate to the queue than the task consumes. Events entering the queue will be picked up immediately by the task. There are two possible explanations for this behaviour:
- If the database receives few modifications, Debezium has equally little work to do. The queue is expected to be empty.
- If modifications on the database are executed at a high rate and the queue is still mostly empty, the graph indicates a bottleneck between the database and the queue. You want to look closer at the
ChangeEventSourceCoordinator
and the pace at which it reads change events. You must determine if the bottleneck is the database, theChangeEventSourceCoordinator
or the network in between.
Monitoring the lag
When we operate a Kafka consumer, we want to monitor its lag. The lag indicates how many messages the consumer must process before it catches up with the producer. Debezium is also a consumer, though it doesn’t consume from Kafka but from a database change stream. The metric MilliSecondsBehindSource
gives us information similar to the consumer lag.
When a database executes a change, it saves information in some buffer. Each change to the database will produce an entry in the buffer. Debezium reads changes from the buffer to produce its change events. MilliSecondsBehindSource
is defined as the number of milliseconds between a change event entering the buffer and Debezium reading it.
MilliSecondsBehindSource
will never be zero. Debezium takes some time to read a change, but as long as the value stays within an acceptable range, It works fine.
An example:
We try to reconstruct what happened here. From the queue usage, we see that the queue was initially empty. Suddenly, the queue is filled and stays full for a few minutes. After some it queue is emptied again. This behaviour matches what we’ve discussed before. A large number of changes must have been executed on the database in a short amount of time. The graph for the lag matches this assumption. The database executes changes at a high rate that Debezium cannot follow. The database change buffer grows quickly. The later a change event arrives in the buffer, the longer it waits. Thus, the graph of the lag increases. When Debezium catches up with the buffer, the lag drops into an acceptable range again.
Observing MilliSecondsBehindSource
is crucial because it may raise alerts before critical issues appear. It is alarming if the lag increases over a longer period and never decreases. In that case, Debezium reads change at a too-slow rate. The buffer between the database and Debezium will continuously grow. Not only does this impact time-critical applications, but it could even lead to data loss. At some point, the database will take action to prevent the buffer from growing too big. Oracle, for example, will remove old parts of the buffer. This means the database may throw away changes Debezium hasn't seen yet.
Conclusion
Setting up a running Debezium is not so difficult, but configuring it to keep up with the pace of the database can be challenging. To construct a performant data pipeline with Debezium, it’s crucial to set up proper monitoring. Not only can it show whether and where you have bottlenecks, but it can also prevent you from falling into catastrophes like data loss. In this blog post, we’ve looked into two aspects of monitoring a pipeline. By visualizing them, you should get a good first impression of the health of your data pipeline.
Questions? Reach out to us via info@spoud.io or our website: spoud.io.