The One Algorithm That Makes Distributed Systems Stop Falling Apart When the Leader Dies

4 min readJust now

–

Distributed System, defined by Leslie Lamport as — “.. A system in which failure of a computer you didn’t even know existed, can render your own computer unusable”.

Making Distributed systems — helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system

It comes with its own headache of -

Network Unreliabilities
Node Unreliabilities
Time, Synchronization Challenges

Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.

There are different types of broadcast algorithms -

Eager Reliable/Gossip Algorithm
FIFO Br…

4 min readJust now

–

Distributed System, defined by Leslie Lamport as — “.. A system in which failure of a computer you didn’t even know existed, can render your own computer unusable”.

Making Distributed systems — helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system

It comes with its own headache of -

Network Unreliabilities
Node Unreliabilities
Time, Synchronization Challenges

Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.

There are different types of broadcast algorithms -

Eager Reliable/Gossip Algorithm
FIFO Broadcast algorithm
Causal (Relative/Event) broadcast
Total Order Broadcast

Out of which Total order is the most reliable/perfect ordering guarantee broadcast algorithm, which involves leader based mechanism to send out messages to nodes in the system

Press enter or click to view image in full size

Total Order Broadcast

One caveat though — How to manage the system when the leader node fails — can be a crash or liveness failure

In such cases, leader election comes into the picture

There are many consensus algorithms to effectively select a leader in a group of nodes — Paxos, Zookeeper Atomic Broadcast (ZAB), Multi-Paxos, Raft (Recent)

In this article, we’ll try to understand Raft as a consensus algorithm and then will also take a look at KRaft — a new Control Plane implementation in Kafka and how it uses Raft for electing leaders for a metadata partition.

Press enter or click to view image in full size

Characteristics of consensus algorithm

As mentioned in image -

It detects crash or failure detection for leader node (majorly by heartbeats)
Avoids having two leaders for a term (term is limited duration of time)
On every election, term is incremented by 1
Every candidate requires — quorum votes ((n+1)/2) to become a Leader

Press enter or click to view image in full size

State Transition of Raft

Start State — Follower (Each node in a term starts as a follower and on crash recovery also — comes in a follower state).

Intermediary — Can be a candidate state and ask for Votes.

After receiving quorum votes it becomes a Leader for the system and then can decide the order of messages to be consumed or delivered.

Kafka as a case study for Raft

Kafka is a distributed Log, used for asynchronous message processing. It involves 2 major components in architecture — the Control Plane and the Data Plane.

Control Plane — Responsible for managing metadata for the Kafka cluster.

Data Plane — Responsible for replicating and managing data for the Kafka cluster.

Press enter or click to view image in full size

Kafka Control Plane (Kraft with no zookeeper)

In the New Control Plane implementation called KRaft — Kafka has removed Zookeeper dependency and performs metadata management using an internal single partition topic called __cluster_metadata.

Press enter or click to view image in full size

Kafka Control Plane — Cluster Metadata Management using _cluster_metadata

A few set of brokers in a Kafka Cluster acts as control plane nodes — called Controllers in Controller Pool, out of which one Controller which is the leader of the single partition topic is called an active controller.

Note — There is no in-sync replica (ISR) maintenance for this Single partition, as it’s on the metadata Plane that decides ISR for data topics.

Hence — A leader crash scenario is crucial and needs a voting technique to decide new leader — in case an active controller crashes.

In KRaft

A term is governed by the epoch time of each controller.

On leader failure, the follower controller transitions to the candidate and makes VoteRequest (Leader Election Request) to Other Controllers.

The image below explains Kraft VotingRequest and Raft Algorithm Voting Requests as a comparison:

Press enter or click to view image in full size

Similarities of KRaft vs Raft (Voting Request)

Press enter or click to view image in full size

Similarities of Kraft and Raft (VotingRequest Evaluation)

When a candidate receives a Grant from (n+1)/2 nodes in a cluster then it broadcasts its leadership election to the cluster

The addition of Kraft in Kafka brings a significant improvement in design — due to the removal of redundant component and making the ecosystem more Kafka analogy-based, by having durability with the use of internal topics.

This article tried covering implementation semantics on a high level with the help of references I read in a few courses

References:

Distributed Systems course by Martin Kleppman, Notes Kafka Internals Course By Jun Rao, Course All the images are taken from the above two references, no copyright infringement is intended

Authored by

Venkatesh Wagh

Technology Tinkerer and enthusiast

Link to Bio

Similar Posts