Kafka cost control
How our open-source solution distributes Kafka costs to user applications
Introduction
For many companies, Kafka is a central bit of infrastructure that facilitates data exchange across teams and departments. As the adoption of Kafka grows within an organization, said organization will have to sooner or later face an important question: Who is going to pay for all this?
This question is not unique to Kafka. Other infrastructure components, notably Kubernetes, as well as a myriad of various cloud offerings from the big cloud platforms (AWS, Azure, GCP), also need to have their costs assigned in some way to the teams and applications incurring those costs.
In the realm of Kubernetes, rich tooling (OpenCost, Kubecost) was developed to enable teams to monitor their costs. Meanwhile, the aforementioned big cloud platforms have been designed from the ground up to offer their users insight into the costs incurred. For example, Google Cloud Platform groups resources (VMs, CloudRun instances, Databases, etc.) into projects. This makes it possible to break down the final cloud bill by project and then forward the project costs to the respective team. In this example, a project is an abstract concept that ties related resources together. “Who should pay for this virtual machine?” is rarely a difficult question. The VM is in project foo, hence the team behind this project should receive the bill for it.
In the world of Kafka, things are a bit more tricky. Kafka is a message broker. It provides rich event streaming capabilities but no cost distribution capabilities. Kafka has topics and principals but no high-level abstractions that we could use to group things together. Even Confluent, the leading Kafka cloud platform, offers nothing on top of these primitives that would allow us to answer the question of “Who is going to pay for this topic?”. Even if we had an answer, the next question would be: “What are the costs of this topic?” or “What are the costs incurred by this principal?”. As we will see later in this post, finding answers to these questions is not trivial.
Another aspect is that you may want to reward teams that produce and share valuable data. From a technical perspective, this is a cost factor but brings value from a business perspective.
In this post, we present our solution to the problem of distributing Kafka-related costs. We will review the concepts, technical/organizational challenges, and solutions. In the end, we describe how we successfully implemented our cost-monitoring application for a customer.
Problem
In the remainder of this post, we will look at two approaches to calculating the costs of Kafka infrastructure and distributing them to the teams and applications using it. We differentiate between the “top-down” approach, where the costs that we want to distribute are known beforehand (for example, thanks to a cloud provider’s billing API), and a “bottom-up” approach, where we have only the raw Kafka usage metrics available to us and we need to approximate the real costs from these metrics according to some pricing rules (a situation that is more common in on-prem Kafka installations).
What the two situations have in common is that at the end of the month, you will get an infrastructure bill. We may know what the entire Kafka platform costs, but we don’t know who’s using what. We have no granularity. This is where Kafka-Cost-Control comes in.
Solution proposition
Whether we choose the top-down or bottom-up approach to calculating the costs of Kafka infrastructure, we will need to collect some metrics. These metrics can be as simple as the number of bytes produced, consumed, or stored in a Kafka cluster. In the bottom-up approach, we will be attaching a price tag to these metrics. For example, we know that 1 GB of storage over a month costs about 10ct. We want to be able to change the price easily, for example, via an API. This would allow us to adjust the price at any time.
In the top-down approach, we are not concerned with the actual monetary amounts. We already have the costs in an external system that we can query whenever needed. In this case, we will use the metrics to compute the usage ratio for each team or application. Then, we will distribute the costs according to this ratio. For example, if team A owns a service account that has produced 50% of the data in the cluster, they will have to pay 50% of the bill for the incoming network traffic.
Lucky for us, the metrics that both approaches depend on are usually attached to a topic or a principal. Both topics and principals are typically owned by a team or some other entity within an organization. This is great news because it means that if we can say who the owner of a topic or a principal is, we can also associate some metrics with them. We can then attach everything we know about that owner to the metrics and use this information to distribute the costs with the desired granularity. For this, we designed a simple rule system where you can match a topic, consumer group, or principal using a regex and assign them a context. The context is a map of arbitrary key/value pairs that will help us later with grouping costs. Context basically adds dimensions to the metrics. It may contain a team name, a department, a cost center, an environment, or anything else that you wish to “summarize” in the end. These summarized and context-enriched metrics can then be passed downstream to a monitoring system, a billing system, or a data warehouse for further processing.
Technical implementation
We think we made that clear, but we love Kafka. So what better way to implement this than using Kafka itself? We need to push metrics to a topic and then use a Kafka Streams application to calculate the cost (only in the bottom-up approach) and assign contexts. The prices and context definitions are also stored in compacted topics. This way, our infrastructure needs will be really low since we’re eating our own dog food.
But we know that directly interacting with Kafka topics is not always easy (a simple curl
is easier than using a proper producer, for instance). That's why we also implemented a simple GraphQL API that allows you to see and edit the prices and context rules. But why stop there? We added a simple UI for the lazy ones.
So, in summary, we need 3 sources of information:
- one or more topics with metrics
- a topic with prices
- a topic with contexts
We can then join everything in a Kafka Streams application and output the results to another topic. This aggregated topic can then be used to feed a database, create reports, or be used in a billing system. We cannot guarantee the regularity and promptness of input metrics. This is why we choose to aggregate data per hour, which is already enough for a cost-control system.
In the documentation, we let you decide on a database of your liking, but we have examples for TimescaleDB. The reports can be made with Grafana or by querying the database directly. Here again, you’re free to make your own, but we propose example dashboards.
The project
This project is fully open source. You can check it out on GitHub: spoud/kafka-cost-control. Don’t forget to check the documentation, where you can find the installation guide, the API documentation, and the user manual.
We recommend using Kubernetes for the deployment. We provide all the necessary deployment files.
We provide a demo environment. Please look at the GitHub repository for the links and credentials.
Implementation example
One of our customers was interested in employing a cost control and monitoring solution for their Kafka infrastructure. The customer is a large corporation with many subdivisions, departments, and teams that are in the process of adopting Kafka on Confluent Cloud. At this same customer, we have previously implemented a central service that is used by teams to provision Confluent service accounts and Kafka topics, as well as request ACLs (permission to read or write) for these topics. This service enforces a naming convention for all topics, which, among other things, includes the name of the department that owns the topic, as well as the name of the application that will be generating the data on this topic.
An example topic name would be acme.prod.app.webshop.orders
. Based on this topic name, we can tell that the topic holds orders of the prod
-instance of the webshop
application. The acme
part of the topic name is the name of the department/division that owns the webshop
application. This naming convention has proven to be very useful because it allows us to generate contexts for already existing topics retroactively (The customer already had hundreds of topics when we set out to implement the cost-control solution). In this case, a context object could look as follows:
{
"creation_time": 1713972458205,
"entityType": {
"io.spoud.kcc.data.EntityType": "TOPIC"
},
"regex": "(\w+)\.prod\.app\.(\w+)\..+",
"context": {
"cluster_id": "lkc-XXXXX",
"application": "$1.app.$2",
"stage": "prod",
"pretty_cluster_name": "GCP Zürich Standard"
}
}
In this example, any topic that matches the regular expression in the regex
field will be assigned the context object in the context
field. Similar naming conventions are also enforced for service accounts, allowing us to map them to the applications used, thus producing a context for each service account.
Some teams at the customer use ksqlDB, which generates some internal topics that do not follow the naming convention. To handle such cases, we extended the provisioning service to allow teams to manually specify a context for a topic. In the background, the service would simply create a context object where the regex matches that topic name exactly, and the context
field is populated with user-provided values.
With this customer, we have opted not to set any pricing rules but instead use each application’s share of total bytes stored/produced/consumed to distribute the costs. This corresponds to the top-down approach that we have described earlier. Furthermore, the customer has decided to distribute the CKU (Confluent Unit for Kafka) costs based on the number of partitions that each application has created. The exact Kafka costs that are to be distributed are queried by the customer from the Confluent Cloud billing API. At that point, the customer sinks the aggregated and context-enriched metrics into their TimescaleDB instance and uses simple SQL queries to generate reports that show how much each team/application has to pay.
The final report is a simple table ordered by cost that looks like this (names and numbers are made up):
This human-readable overview is then made available to the teams and applications that are interested in a breakdown of their Kafka costs. Another more technical report is forwarded to the customer’s central billing system, which then uses this data to generate the final bill for each team.
In addition to this monthly reporting, different teams within the customer’s organization may want to see in real-time how their usage of Kafka stacks up against other teams. To monitor this, we have set up a Grafana dashboard that shows (among other things) the fraction of total bytes produced/consumed by each application and billing unit (a single project might have multiple applications that are under the same billing unit). We could have set up the dashboard to query the TimescaleDB instance directly, but in this case, the customer’s networking setup would have made this difficult. To address such issues, our Kafka Streams application also exposes the most recent aggregated metrics as gauge metrics in Prometheus format.
Furthermore, it was very easy to include additional cost centers in the cost-control solution. For example, the customer uses a dedicated Confluent Cloud cluster that is only reachable via Azure Private Link from inside the customer’s network. This service charges based on inbound and outbound network traffic. We have simply added the Private Link costs to the total network transfer costs to be distributed.
Having all this information in one place also enabled us to generate rich reports that allow business stakeholders to identify cost drivers and see who the big infrastructure users are. Below is an example Sankey diagram we generate for the customer based on collected metrics and additional cost-center information (application names and values are randomized for privacy reasons).
Cross-charging
The customer has identified that some teams are producing and storing a lot of data that is then consumed by many other teams. In the initial setup, these costs were not passed on to the consumers. This is not ideal because it does not incentivize teams to share data with each other. To address this, we have implemented a solution on top of the cost-control service that allows teams to identify their data consumers and “invite” them to share the costs of storing the data they consume.
In the world of Confluent Cloud, this is not a trivial issue to solve because there are no metrics that can tell us how many bytes a principal has consumed from a specific topic. A possible solution would be to try to derive this consumer information from Confluent’s Audit Log, which can be configured to log all fetch requests made by a consuming principal against a topic.
We chose instead to use ACLs (Access Control Lists defining who is allowed to access what data in Kafka) to determine who is consuming what, which is a much simpler (and also portable) solution. The idea is that if a service account is allowed to read from a topic, then its associated application must be a consumer of that topic. This solution has the nice side-effect of incentivizing a “least privilege” policy for permissions, which is generally a good idea for security reasons. Teams that do not follow the least privilege policy and request read permissions for topics they don’t need may still be charged for said topics.
But what if we don’t have a naming convention?
In cases where an organization does not have a naming convention or where the entity that should be billed is not part of the topic/principal name, our solution can still be used. Simply create a context object whose regex matches the topic/principal name exactly and assign the desired context to it:
{
"creation_time": 1713972458205,
"entityType": {
"io.spoud.kcc.data.EntityType": "TOPIC"
},
"regex": "^eu-purchases$",
"context": {
"application": "webshop",
"department": "retail",
"market": "europe",
"pretty_cluster_name": "GCP Zürich Standard"
}
}
Conclusion
We have presented our solution to the problem of distributing Kafka-related costs. Our easy-to-deploy Kafka Streams application allows organizations to take their very low-level Kafka metrics and turn them into actionable insights. As we have seen in the implementation example, the solution is not opinionated but rather flexible. We make no assumptions about the contents of the context that will be attached to the aggregated metrics. This allows organizations to group and distribute usage in the exact granularity required. There could be hierarchical groupings or many different groupings. We don’t limit by any dimension. For example: by team, department, geographical region, or any other criteria or combination of criteria.
In the end, Kafka cost control is not just about cutting costs. It’s about creating a more transparent, more efficient way of working with Kafka. Users should not be punished for using resources but rewarded when making data available on Kafka. This is a step towards a more data-driven organization.