In a perfect world, small pieces of software scale up and down with a high-performance message broker like a heart, pumping data in a controlled, efficient manner. However, at some point, in every single system, an application starts pumping out large data requests, in a sporadic, uncontrollable manner. Because here, we are talking about the real world, right? In this article, we will address the problem of adapting a legacy system to play with a stream-based platform.
Enterprises often have several servers, firewalls, databases, mobile devices, API endpoints, and other infrastructure that powers their IT. Because of this, organizations must provide resources to manage logged events across the environment. Logging is a factor in detecting and blocking cyber-attacks, and organizations use log data for auditing during an investigation after an incident. Brokers, such as Apache Kafka, will ingest logging data in real-time, process, store, and route data.
There are two big gaps in the Apache Kafka project when we think of operating a cluster. The first is monitoring the cluster efficiently and the second is managing failures and changes in the cluster. There are no solutions for these inside the Kafka project but there are many good 3rd party tools for both problems. Cruise Control is one of the earliest open source tools to provide a solution for the failure management problem but lately for the monitoring problem as well.
In the first part of this blog, we covered the basics of the Kafka ecosystem and explored the options for exporting Kafka metrics—first using the Jolokia JVM agent and then via the Prometheus JMX agent. Here in this post, we’ll go through some key Kafka metrics that are available on Grafana for building visualizations and alerts. Although Kafka provides hundreds of metrics, as described here , we are going to cover the most important ones to monitor.
Apache Kafka is an open-source platform for distributed data streaming that provides highly reliable and fault-tolerant capabilities to process a large number of events using the publish-subscribe model. Kafka also provides the capability to store and process events per a given use case and requirements, plus it can run as a single node or scale up to a cluster of nodes.
In March 2021, a 200,000 tonne ship got stuck in the Suez Canal, and the global shipping industry suddenly caught the world’s attention. It made us realize ships play an important role in our daily lives. Really important in fact; 90% of the things we consume arrive by ship. Take a look at this map. By visualizing vessel routes over time, the pattern creates a map of the earth. Note the lack of vessels travelling close to the coast of Somalia where piracy is common.
Suppose that you work for the infosec department of a government agency in charge of tax collection. You recently noticed that some tax fraud incident records went missing from a certain Apache Kafka topic. You panic. It is a common requirement for business applications to maintain some form of audit log, i.e. a persistent trail of all the changes to the application’s data. But for Kafka in particular, this can prove challenging.
Suppose that you work for a government tax agency. You recently noticed that some tax fraud incident records have been leaked on the darknet. This information is held in a Kafka Topic. The incident response team wants to know who has accessed this data over the last six months. You panic. It is a common requirement for business applications to maintain some form of audit log, i.e. a persistent trail of all the changes to the application’s data to respond to this kind of situation.
We live in a dynamic world. It is safe to say that companies aim to speed up time-to-market and out-innovate their competition with Kafka, but at the same time struggle with some limitations. These can range from compliance-related setbacks for regulations such as GDPR, CCPA and HIPAA, to self-service slip-ups that could see a whole Kafka cluster going down. Even something as seemingly innocuous as configuring and creating a Kafka Topic can lead to operational U-turns, slowdowns and even downtime.