Ready, Steady, Connect. Help Your Organization to Appreciate Kafka

Published in

The Startup

5 min readSep 18, 2020

If you want to enable your organization to leverage the full value of event-driven architectures, it is not enough to just add Kafka to the existing enterprise technology mix and wait for people to join the party. Experience shows some preparation is in order.

written by Matthias Rüedlinger

The Chicken and Egg Problem

The value of Kafka for other teams rises with the growing number of data streams offered. To get more people to use Kafka we must simplify the data consumption and production. It’s a chicken and egg problem. We do not get more people using Kafka if there is no data. The main objective is to convince data producers in enterprises to publish their data as real-time events in Kafka. One reason these events do not exist can be that they do not have the Kafka knowledge or the system they run does not provide the Kafka connectivity out of the box.

To get to the tipping point where Kafka is fully used in the enterprise we need to convince the critical mass to use and learn Kafka. So you need to make it as simple as possible to add and consume event streams from or to external systems like databases, document stores, S3 or whatever data source you might be using in your enterprise. We need some kind of training-wheels for Kafka, where teams that are not yet fully Kafka-savvy can learn and get some experience with Kafka and real-time events from these external systems.

Apache Connect to the Rescue

Apache Kafka Connect is a framework for connecting Kafka with external systems. With Kafka Connect we have connectors allowing us to bring data into or out of Kafka in a standardized and reliable way from different data sources. A connector itself is just a JAR file that defines how to integrate with that external system. The connector itself can then be configured over a REST API which is provided by Kafka Connect. With these connectors, we have standardization of how data is produced and consumed from these external systems. Connect can be run in a standalone or distributed mode. In distributed mode, Kafka Connect will store the metadata (connector configuration, offsets, etc) in Kafka. The standalone mode is great for trying things out, but not meant to run in production. So when you consider running Kafka Connect the way to go is to run it in the distributed mode which provides scalability and automatic fault tolerance out of the box.

The connector itself can be a sink or source connector. Sink connectors write data from Kafka to a specific system and source connectors bring data from these systems to Kafka. Kafka Connect also supports different Converters which handles the serialization and deserialization of different formats like JSON Schema, Avro and Protobuf.

There is also support for some transformations before the data is written to Kafka or the external systems. These transformations are called Single Message Transformation (SMT) and as the name suggests the transformation can only be applied on a message. They are very useful when the sink or source format can not be modified and you want to add, remove or rename some fields in the message. When you want to do complex transformations, like combining or splitting messages, Kafka Connect is not the right tool and you would have a look at Kafka Streams.

There are already a lot of connectors available as commercial or open-source licenses for different systems. If you don’t find a connector that suits your needs you always have the possibility to write a connector yourself in Java. The nice thing about this is it is not really that complicated for people who are used to developing software applications in Java.

Self-Service Data Consumption and Production

In the current IT-landscape we have moved the past years from a monolith architecture to a distributed Microservice architecture where teams have full responsibility for their applications. This means “you build it and you run it”, better known as DevOps. With Kafka Connect we have a centralized component that you can see as infrastructure which is shared by multiple teams.

In our case, to enable teams we went to the conclusion to look at Kafka Connect as a Microservice which is run by the teams themselves for a specific purpose. For example, a data warehouse team would run their own Kafka Connect instance to load Kafka events into their staging area. One reason we think teams should run their own Kafka Connect is you have clear boundaries who is responsible when you receive alerts, have failed deployments or errors.

But with this approach, you need an infrastructure Team which provides the tooling for monitoring and the lifecycle management so that the DevOps teams can easily set up and run their own Kafka Connect instance. The goal must be that the DevOps teams run a productive Kafka Connect within hours and they have a high level of automation for deploying connectors and upgrading to new Kafka Connect versions.

Conclusion

Kafka Connect is a great enabler for teams to integrate external systems with Kafka. Kafka Connect allowed us to solve repeating integration problems in a standardized way which is reliable and fault-tolerant. Once our team had some experience with a specific connector the integration with the same type of connector was done very quickly.

As a Java Team, we also had good experience writing connectors ourselves. The main reason was the system we had to integrate was very specific and there was no existing solution to our problem. The Connect Java API, which is part of Kafka, is straightforward and was quite easy to write our own connector and transformations. The tutorials you find online gave us a good start, but I would recommend you have a look at the source code of some of these connectors in GitHub to get some inspiration on how other connectors or transformations were implemented.

It would be nice to hear what you think about the centralized vs. decentralized approach in running Apache Kafka Connect or in general what is your experience with Kafka Connect.

Reach out to us here in the comments, through www.agoora.com, or through Twitter.

Ready, Steady, Connect. Help Your Organization to Appreciate Kafka

The Chicken and Egg Problem

Apache Connect to the Rescue

Self-Service Data Consumption and Production

Conclusion

Written by SPOUD