Bargunan Somasundaram

Call it big data or the big bang of data – we’re in an era of data explosion. Our daily lives generate an enormous amount of data. Let’s do some simple math. About 12 billion ‘smart’ machines are connected to the Internet. Considering there are about 7 billion people on the planet, we have almost 1.5 devices per person. The data produced every year is in exabytes and is growing exponentially. There’s always been a search for an infrastructure to handle this amount of data. One such venture at LinkedIn in 2010 fructified in the creation of Kafka. It was later donated to Apache foundation and now it’s called Apache Kafka.

What’s Kafka?

Apache Kafka is a Versatile, Distributed, & Replicated publish-subscribe messaging system. It lets you send messages between processes, applications, and servers.

To understand what a publish-subscribe messaging system is, understanding how a point-to-point messaging system works is important. In a point-to-point messaging system, messages are kept in a queue, and multiple consumers can consume the messages. Once a message is consumed it disappears from the queue.

In a publish/subscribe system, messages are persisted in a topic. Unlike in a point-to-point system, consumers can subscribe to one or more topics and consume all messages on that topic. Different consumers can consume messages and remain on the topic so another consumer can receive the same information again. Hence, Kafka is a publish-subscribe messaging system.

“More than 33% of all Fortune 500 companies use Kafka.”

Apache Kafka is a distributed real-time streaming platform, but in the eyes of a developer it’s an advanced version of a log which is distributed and structured.

Why Kafka?

The two major concerns of Big Data are to collect it and to be able to analyze it. A messaging system like Kafka can help overcome these challenges. This allows the applications to focus on the data without worrying about how to share it. For systems which have high throughput, Kafka works much better than traditional messaging systems. It also has better partitioning, replication, and fault-tolerance which makes it a great fit for systems which process large-scale messages.

Following are the reasons to choose Kafka over any other messaging system:

  • One of the most powerful event streaming platforms available open source
  • Offers solid horizontal scalability
  • A perfect fit for big data projects involving real-time processing
  • Durably stores the data using Distributed commit log meaning that the data is persisted on the disk
  • High reliability, since it is distributed and replicated.
  • Has excellent parallelism since the topics are partitioned.

Core components of Kafka

Kafka’s main architectural components include Producers, Topics, Consumers, Consumer Groups, Clusters, Brokers, Partitions, Replicas, Leaders, and Followers.

  • Records: Data is stored in the form of key value pair with time stamp which is called Record. Kafka Records are immutable.
  • Topic: Topics are a stream of records and are subscribed to by multiple consumers. It’s the highest level of abstraction that Kafka provides.
  • Partition (A Structured commit log): It’s an ordered, immutable sequence of messages that are continually appended to. It can’t be divided across brokers or even disks. The memory needs to be contiguous. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
  • Segments: Each partition is sub-divided into segments. Instead of storing all the messages of a partition in a single file, Kafka splits them into chunks called segments. The default value for segment size is a high value (1 GB).
  • Brokers: A Kafka broker (also called node or server) hosts topics. A Kafka broker receives messages from producers and stores them on disk by assigning them a unique offset. A Kafka broker allows consumers to fetch messages by topic, partition and offset.
  • Zookeeper: Zookeeper is a centralized service which is used to maintain naming and configuration data and to provide flexible and robust synchronization within distributed systems like Kafka. Zookeeper has the responsibility to maintain the leader-follower relationship across all the partitions.
  • Cluster: Multiple Kafka brokers join to form a cluster. The Kafka brokers could be distributed in different data centers and physical locations for redundancy and stability. The Kafka brokers communicate between themselves using zookeeper.
  • Replication: All distributed systems must make trade-offs between guaranteeing consistency, availability, and partition tolerance (CAP Theorem). Apache Kafka’s design focuses on maintaining highly available and strongly consistent replicas. Strong consistency means that all replicas are byte-to-byte identical, which simplifies the job of an application developer.
  • Producer: Kafka producers send records to topics. The records are sometimes referred to as messages. While producers can only message to one topic at a time, they’re able to send messages asynchronously. Using this technique allows a producer to functionally send multiple messages to multiple topics at once. Because Kafka is designed for broker scalability and performance, producers (rather than brokers) are responsible for choosing which partition each message is sent to. 
  • Consumer: Kafka consumers read from Topics. Kafka consumer maintains the partition offset to consume messages or data from topic, since Kafka brokers are stateless. The consumers can rewind or skip to any point in a partition simply by supplying an offset value.
  • Consumers Group: Multiple consumers who are interested in the same topic join to form a Consumer group which is uniquely identified by group.id. Each consumer group is a subscriber to one or more topics and maintains its offset per topic.

How does Apache Kafka work?

When applications send data to a Kafka Broker (Node), the data gets stored in a topic, which is the logical grouping of partitions. Partitions are the actual unit of storage. In a multi node configuration, the data is spread over multiple partitions across different machines. Now the data sent to the Kafka Cluster is durably persisted to a partition. As explained before a partition is an immutable data structure, where data can only be appended.

The data is sent to a partition based on the following rules,

  1. If a producer specifies a partition number in the message record, then the message is persisted to that topic.
  2. If a message record doesn’t have any partition id but has a key, then based on the hash value of the key, the partition is chosen.

hashCode(key) % noOfPartitions

  • If no key or partition id is present, then Kafka uses round-robin strategy to choose the partition.

To achieve parallelism, each topic can have multiple partitions. Number of partitions is directly proportional to throughput and parallel access.

In a distributed environment, even though a topic has multiple partitions, each partition is tied to a single broker only, it’s not shared among the nodes.

What if a Kafka node fails and all the partitions tied to that node become unavailable?

To overcome this scenario, Kafka uses replication. A duplicate of each partition is maintained in all the nodes. At all times, one broker ‘owns’ a partition and is the node through which applications write/read from the partition. This is called a partition leader. It replicates the data it receives to other brokers, called followers. They store the data as well and are ready to be elected as the leader in case the leader node dies.

At any point of time, all the replicas will be identical to the leader (original) partition. This is called In-Sync Replication. For a producer/consumer to write/read from a partition, they need to know its leader so, this information needs to be available from somewhere. Kafka stores such metadata in Zookeeper.

Kafka must periodically flush the data in the disk. Even though the partition is an immutable data structure, it must purge the data based on the retention period. Since deleting a single large file is error prone, a partition is further subdivided into segments. Instead of storing all the messages into a partition, it is split into chunks called segments. Whenever a segment has reached its max value, a new segment is created, and it becomes the new active segment.

Inside the partition’s directory in the Kafka data directory, the segments can be viewed as index and log files.

/opt/Kafka-logs # tree

Why does each segment have .index file accompanied by .log file?

One of the common operations in Kafka is to read the message at a particular offset. For this, if it has to go to the log file to find the offset, it becomes an expensive task especially because the log file can grow to huge sizes (Default – 1G). This is where the .index file becomes useful. Index file stores the offset and physical position of the message in the log file.

The 00000000000000077674 before the log, index and time index file is the name of the segment. Each segment file is created with the offset of the first message as its file name. In the above picture, 00000000000000077674.log implies that it has the messages starting from offset 77674.

The index file describes only 2 fields, each of them 32 bit long:

  • 4 Bytes: Relative Offset                     
  • 4 Bytes: Physical Position

As explained, the file name represents the base offset. In contrast to the log file where the offset is incremented for each message, the messages within the index files contain relative offsets to the base offset. The second field represents the physical position of the related log message (base offset + relative offset).

Content of index file:

  • Index offset: 77723, log offset: 77713
  • Index offset: 77710, log offset: 77700

Content of log file:

offset: 77723 position: 77713CreateTime: 1568206189326 isvalid: true keysize: -1 valuesize: 10 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] payload:Hello World

offset: 77710 position: 77700 CreateTime: 1567613494343 isvalid: true keysize: -1 valuesize: 18 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] payload: Hello world there

If you need to read the message at offset 77723, you first search for it in the index file and figure out that the message is in position 77713. Then you directly go to position 77713 in the log file and start reading. This makes it effective as you can use binary search to quickly get to the correct offset in the already sorted index file.

From the above explanation, we can derive the time complexity for a few scenarios,

  • To find a particular partition = O(1) Constant time

since the broker knows where the partition resides in for a given topic.

  • To find a segment in a partition = O(log(n))

since the first part of the segment log file indicates the first message offset. So, the binary search can be used to find the right segment.

  • To find a message in a segment = O(log(n))

The index file contains the exact position of a message in the log file for all the messages in the ascending order of the offsets. The offset can be found using a binary search.

Now using one of the above scenarios, the Kafka will locate the message and serve the consumers.

As apache Kafka is a pub-sub messaging system, the consumers who have subscribed to a topic will receive the data and can consume it. The consumer stores (commits) the offset every time it pulls data from a topic. By specifying the offset next time, the consumers pull data from the topic from the offset it mentioned, in a reliable way.

This is how Apache Kafka works!