How to rule the chaos with Apache Flink?

4 min readMar 4, 2020

Real-time anomaly detection for microservice monitoring with Apache Flink

Good digital customer experience not only needs thoughtful user interfaces, it also requires backend systems that deliver good data on time. In the following blog post we would like to share with you our learnings on building a beautiful monitoring system based on Apache Flink.

When we started to work on the project with our client, a Swiss insurance provider, we faced the complexity of their distributed digital backend. Their system consists of several hundred microservices. A delicate network of processes, that rely on each other. A challenge to unfold and tackle.

First, we needed to figure out how to deal with a diverse set of ever-evolving data producers, with changing schemas and how to map data to a common format. Our contribution to this endeavour was to create a software tool to enable DevOps engineers to evolve the system of several hundred microservices more easily and to find and react to operational challenges on or even ahead of time. We used Apache Flink to provide multiple layers of aggregation up to business-relevant metrics. Apache Flink was chosen because it allows for distributed and parallelized processing and has a very good integration with Apache Kafka.

In addition, we analyze the highly seasonal data with our anomaly detection algorithm MULDER triggered by Apache Flink. The MULDER algorithm, developed by SPOUD, detects anomalies by assigning an anomaly score to each point in time, where the results are easily interpreted by comparing against the typical value of the time series. Also, our algorithm can be used to automatically create alerts with a low false positive rate, and outperforms several popular open source implementations such as Twitter’s Seasonal Hybrid ESD and AWS’ Random Cut Forest algorithms on the Numenta Anomaly Benchmark (NAB).

To get started you are going into your applications architecture and apply distributed tracing. The way you do this is the limitations you will have on analytics. Most important is to measure and order events in time.

Furthermore, very important is to know the causality of your requests or transactions through the forest of services. Which means you should have identifiers which may identify a session (multiple transactions). From there you identify the single transaction or trace in the system. Each action in the trace should reference to its parent action where the trigger was.

In this way you end up with a forest of traces.

Make sure that the ids are added to the log lines because you may want to correlate this later on.

You then may add some identifier to reference the caller and callee application (or producer and consumer), anything that helps you reference other monitoring systems or people / teams responsible for the service. In some cases you may also want to indicate application architecture facets like application hierarchies or different environments. Add this to the tags of your tracing.

The tracing data should then be collected by a central system and put into a queue. This is where Kafka comes into play. With Kafka you are able to serve a queue which handles millions of events a day, an hour maybe a minute.

The next stage is to get rid of as much data as you can. This makes sense if you want to derive high level metrics but not for tracing. If you want to analyze the traces try to store it early in the process and think about retention policies for this data. This can grow extremely fast and for the most use cases it’s not relevant anymore after a few days.

To get the high level metrics think about how to group your data. With Apache Flink you can add event time windows which allow you to consolidate all the data that happened in the same time window into one event. The right grouping will give you some hundred events per minute instead of millions.

Here as well think about your retention strategies and if needed build up larger windows from the small ones. In our case we have chosen to create minute windows in the first stage and then also store hourly aggregations.

Be careful with aggregations of aggregations this may lead to wrong results e.g. averaging averages.

Do you have any questions? Would you be interested in a similar solution for your enterprise or you just simply have a few comments? Please, let us know in the comments or through