Apache Kafka is a distributed messaging system providing fast, highly scalable and durable messaging through a publish-subscribe model. It was originally developed at LinkedIn for fulfilling specific needs that could not be achieved using existing messaging products. As a messaging server, Kafka is comparable to traditional popular messaging systems such as ActiveMQ or RabbitMQ, the main purpose of which is to decouple processing from data producers and buffering unprocessed messages.
Part 1 of this post explores what makes Kafka so unique and different from other messaging products. Part 2 considers these characteristics in more detail, and explores Kafka’s role in the IoT (internet of things) world.
A Top-Level View of Kafka
Though Kafka works well as a replacement for traditional message brokers, it’s much more than a messaging server. Kafka works more like a distributed database: when you write a message to Kafka, it’s replicated to multiple servers and committed to disk. Ideally, the data can be stored forever.
Kafka is designed as a modern distributed system: the cluster is elastically scalable and fault tolerant, and applications can transparently scale out to produce or consume massive distributed streams. Kafka’s distributed design gives it several advantages like high availability, resilience to node failures as well as support for automatic recovery, which make it an ideal fit for communication and integration between components of large-scale data systems.
How Kafka’s Components Work Together
Kafka’s architectural components are no different from traditional messaging systems:
- Brokers (forming a Kafka cluster)
- Topic (a category or feed name to which messages are published)
- Messages (payload of bytes)
- Producers (that publish messages to a topic)
- Consumers (that subscribe to one or more topics and consume the published messages by pulling data from the brokers)
Yet it’s the way the components are built to work together that makes Kafka so unique. This is based on the concept of transaction logs (also known as commit logs or write-ahead logs) used by most of the database systems. Each topic is divided into a set of logs (physically, a set of segment files of equal sizes) known as partitions. Producers write to the tail of these logs and consumers read the logs at their own pace. Kafka uses ZooKeeper for storing configurations, as a registry index, and to maintain co-ordination between different nodes in a cluster.
Topics are partitioned and replicated across multiple nodes. The number of partitions dictates the maximum parallelism of the message consumers. Each partition can be replicated across a configurable number of servers for fault tolerance; the replication factor controls how many servers will replicate each message that is written. Kafka uses a peer-to-peer configuration of brokers with no single point of failure.
Kafka clients (producers) directly control how a particular piece of data is assigned to a particular partition. The brokers do not enforce any particular semantics of which messages should be published to a particular partition. Rather, to publish messages, the client directly addresses messages to a particular partition, and when fetching messages, fetches from a particular partition. Thus, if two clients want to use the same partitioning scheme, they must use the same method to compute the mapping of key to partition.
Kafka’s Approach to Messaging
Kafka uses the concept of consumer groups (a collection of message subscribers/consumers) for message consumption. The partitions in the topic are assigned to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. This mechanism helps in parallelizing consumption of messages within a topic. Therefore there cannot be more consumer instances in a consumer group than partitions. A consumer can fetch data from a live data stream or read from a committed log file. Additionally, the order in which the messages will be received by a consumer will be the same in which the messages were published to the partition of the topic.
The messages in the partitions are each assigned a sequential ID number called the offset that uniquely identifies each message within the partition. The published messages are retained in the Kafka cluster for a configurable period of time. When the messages are consumed, Kafka can save the offset of the message last read in the cluster. The advantage here is that it can consume the messages in any order it wants, and also re-consume a message if required.
Consumer rebalancing is triggered on each addition or removal of both broker nodes and other consumers within the same group. The consumer rebalancing allows all the consumers in a group to come into consensus on which consumer is consuming which partitions.
It’s noteworthy that Kafka only supports topics and there is no concept of a queue in Kafka. Yet the queue concept can be achieved by design – by having only one Consumer Group consuming the messages from a topic.
Kafka is Multi-Talented
Quotas can be defined per broker, which will slow down clients if needed. This feature can be used to reduce damage caused by misbehaving applications/devices. Another benefit of Kafka is that it comes with tools that can mirror data from multiple datacenters. This can provide clusters which have aggregated data mirrored from all datacenters, which can be used for reads by applications that require this. This feature will be quite useful for IoT use cases in which the installed base of an organization is spread across multiple geographies.
Communication Protocols Supported by Kafka
It is interesting to note that Kafka uses a binary protocol over TCP and does not support any of the existing popular messaging protocols like JMS, AMQP, HTTP or WebSockets (unlike other messaging systems). The binary protocol defines all APIs as request response message pairs. The client initiates a socket connection and then writes a sequence of request messages and reads back the corresponding response message. No handshake is required on connection or disconnection.
Security in Kafka
Security is one of the most essential requirements in the IoT world. Though the earlier versions of Kafka were weak on security, the latest version has a number of features that increases security in a Kafka cluster.
Kafka supports the following security features (arguably still of beta quality):
- Authentication of connections to brokers from clients using either SSL or SASL (Kerberos)
- Encryption of data transferred between brokers and clients using SSL
- ACL for authorization of read / write operations by clients
- Integration with external authorization services.
These provide reasonably decent features for Kafka in internet scale deployments.
Taking Stock of Kafka’s Version Releases
The current stable version of Kafka is 0.9. Since Kafka is still evolving, there are major changes in the architecture and functioning of latest version as compared to earlier versions.
For example, in previous versions, ZooKeeper was tightly coupled to the broker. In fact, the producer and consumer information was also retained in the ZooKeeper. But in version 0.9.x, the ZooKeeper is kept as a separate component, and producer/consumer need not directly interact with it. Similarly, earlier versions were using two different types of APIs: high-level ZooKeeper-based consumer and low-level consumer APIs. In the latest version, only one set of Java consumer (beta quality) are made available. The earlier version was based on Scala, which has now been wrapped with Java APIs. So if we are evaluating Kafka to be used for production systems, we need to be open to the changes being made in the architecture from time to time. It is noteworthy that in the 0.9 version, nearly 300 bugs have been fixed, some of which are quite important.
Though Kafka has still not reached a stable 1.0 version, the GA version is available on a subscription basis as “Confluent Platform” on cloud with a lot of additional features. Presently, Kafka provides only a Java client, but Confluent Platform provides APIs in variety of languages, including Java, C/C++, Python, Perl, and Ruby, as well as REST-based endpoints.
Kafka also provides tools for reading data from external sources into Kafka, or writing data from Kafka into external destinations. Kafka Connect includes connectors for moving data in and out from files, SQL databases, and the HDFS file system.
In part 2 we will explore Kafka’s useful features for IoT.