4 min readJust now
–
Distributed System, defined by Leslie Lamport as — “.. A system in which failure of a computer you didn’t even know existed, can render your own computer unusable”.
Making Distributed systems — helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system
It comes with its own headache of -
- Network Unreliabilities
- Node Unreliabilities
- Time, Synchronization Challenges
Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.
There are different types of broadcast algorithms -
- Eager Reliable/Gossip Algorithm
- FIFO Br…
4 min readJust now
–
Distributed System, defined by Leslie Lamport as — “.. A system in which failure of a computer you didn’t even know existed, can render your own computer unusable”.
Making Distributed systems — helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system
It comes with its own headache of -
- Network Unreliabilities
- Node Unreliabilities
- Time, Synchronization Challenges
Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.
There are different types of broadcast algorithms -
- Eager Reliable/Gossip Algorithm
- FIFO Broadcast algorithm
- Causal (Relative/Event) broadcast
- Total Order Broadcast
Out of which Total order is the most reliable/perfect ordering guarantee broadcast algorithm, which involves leader based mechanism to send out messages to nodes in the system
Press enter or click to view image in full size
Total Order Broadcast
One caveat though — How to manage the system when the leader node fails — can be a crash or liveness failure
In such cases, leader election comes into the picture
There are many consensus algorithms to effectively select a leader in a group of nodes — Paxos, Zookeeper Atomic Broadcast (ZAB), Multi-Paxos, Raft (Recent)
In this article, we’ll try to understand Raft as a consensus algorithm and then will also take a look at KRaft — a new Control Plane implementation in Kafka and how it uses Raft for electing leaders for a metadata partition.
Press enter or click to view image in full size
Characteristics of consensus algorithm
As mentioned in image -
- It detects crash or failure detection for leader node (majorly by heartbeats)
- Avoids having two leaders for a term (term is limited duration of time)
- On every election, term is incremented by 1
- Every candidate requires — quorum votes ((n+1)/2) to become a Leader
Press enter or click to view image in full size
State Transition of Raft
Start State — Follower (Each node in a term starts as a follower and on crash recovery also — comes in a follower state).
Intermediary — Can be a candidate state and ask for Votes.
After receiving quorum votes it becomes a Leader for the system and then can decide the order of messages to be consumed or delivered.
Kafka as a case study for Raft
Kafka is a distributed Log, used for asynchronous message processing. It involves 2 major components in architecture — the Control Plane and the Data Plane.
Control Plane — Responsible for managing metadata for the Kafka cluster.
Data Plane — Responsible for replicating and managing data for the Kafka cluster.
Press enter or click to view image in full size
Kafka Control Plane (Kraft with no zookeeper)
In the New Control Plane implementation called KRaft — Kafka has removed Zookeeper dependency and performs metadata management using an internal single partition topic called __cluster_metadata.
Press enter or click to view image in full size
Kafka Control Plane — Cluster Metadata Management using _cluster_metadata
A few set of brokers in a Kafka Cluster acts as control plane nodes — called Controllers in Controller Pool, out of which one Controller which is the leader of the single partition topic is called an active controller.
Note — There is no in-sync replica (ISR) maintenance for this Single partition, as it’s on the metadata Plane that decides ISR for data topics.
Hence — A leader crash scenario is crucial and needs a voting technique to decide new leader — in case an active controller crashes.
In KRaft
A term is governed by the epoch time of each controller.
On leader failure, the follower controller transitions to the candidate and makes VoteRequest (Leader Election Request) to Other Controllers.
The image below explains Kraft VotingRequest and Raft Algorithm Voting Requests as a comparison:
Press enter or click to view image in full size
Similarities of KRaft vs Raft (Voting Request)
Press enter or click to view image in full size
Similarities of Kraft and Raft (VotingRequest Evaluation)
When a candidate receives a Grant from (n+1)/2 nodes in a cluster then it broadcasts its leadership election to the cluster
The addition of Kraft in Kafka brings a significant improvement in design — due to the removal of redundant component and making the ecosystem more Kafka analogy-based, by having durability with the use of internal topics.
This article tried covering implementation semantics on a high level with the help of references I read in a few courses
References:
Distributed Systems course by Martin Kleppman, Notes Kafka Internals Course By Jun Rao, Course All the images are taken from the above two references, no copyright infringement is intended
Authored by
Technology Tinkerer and enthusiast
Link to Bio