Kafka

03/07/2018

Kafka is a distributed publish-subscribe messaging system. Publish-subscribe refers to a pattern of communication in distributed systems where the producers/publishers of data produce data categorized in different classes without any knowledge of how the data will be used by the subscribers. The consumers/subscribers can express interest in specific classes of data and receive only those messages. Kafka uses a commit log to persist data. The commit log is an ordered, immutable, append-only data structure that is the main abstract data structure that Kafka manages. The main advantage of Kafka is the fact that it provides a unifying data backbone from which all systems in the organization can consume data independently and reliably.

Topics

Topics represent a user-defined category to which messages are published. An example topic one might find at an advertising company could be AdClickEvents. All consumers of data read from one or more topics. Topics are generally maintained as a partitioned log (see below).

Producers

Producers are processes that publish messages to one or more topics in the Kafka cluster.

Consumers

Consumers are processes that subscribe to topics and read messages from the Kafka cluster.

Partitions

Topics are divided into partitions. A partition represents the unit of parallelism in Kafka. In general, a higher number of partitions means higher throughput. Within each partition each message has a specific offset that consumers use to keep track of how far they have progressed through the stream. Consumers may use Kafka partitions as semantic partitions as well.

Brokers

Brokers in Kafka are responsible for message persistence and replication. Producers talk to brokers to publish messages to the Kafka cluster and consumers talk to brokers to consume messages from the Kafka cluster.

Replication

Kakfa uses replication for fault-tolerance. For each partition of a Kafka topic, Kafka will choose a leader and zero or more followers from servers in the cluster and replicates the log across them. The number of replicas including the leader is determined by a replication factor. The leader of the partition will handle all the reads and writes while the followers consume messages from the leader in order to replicate the log. Since both leader and follower may fail when a server in the cluster fails, the leader keeps track of alive followers (in-sync replicas (ISR)) and removes unhealthy ones. In this case, if the leader dies, an alive follower will become the new leader of this partition. This mechanism allows Kafka to remain functional when failures exist.