Introduction: Why I Built a RAFT Implementation
When I started my Masterβs degree in Software Engineering, I kept hearing the term "RAFT" in system design discussions. Databases use it. Kubernetes uses it. Google, Meta, and every company managing millions of requests relies on it. But I didnβt understand it not really.
So I decided to build one from scratch.
I built a production-grade RAFT consensus implementation that goes beyond theoretical understanding. This project demonstrates:
- Complete RAFT protocol with all 10 invariants verified
- Real-world concerns: persistence, concurrent RPC handling, recovery from failures
- Production-thinking: thread safety, error handling, clean abstractions
- 3-node distributed cluster that actually works
- Real-time WebSocket visuβ¦
Introduction: Why I Built a RAFT Implementation
When I started my Masterβs degree in Software Engineering, I kept hearing the term "RAFT" in system design discussions. Databases use it. Kubernetes uses it. Google, Meta, and every company managing millions of requests relies on it. But I didnβt understand it not really.
So I decided to build one from scratch.
I built a production-grade RAFT consensus implementation that goes beyond theoretical understanding. This project demonstrates:
- Complete RAFT protocol with all 10 invariants verified
- Real-world concerns: persistence, concurrent RPC handling, recovery from failures
- Production-thinking: thread safety, error handling, clean abstractions
- 3-node distributed cluster that actually works
- Real-time WebSocket visualization to see the algorithm in action
- 100% test pass rate across 10 comprehensive resilience tests
But more importantly: I now understand not just RAFT, but why distributed systems are hard.
What is RAFT? The Consensus Algorithm Behind Modern Databases
Before jumping into my implementation, let me explain RAFT clearly, because if you donβt understand it, the code wonβt make sense.
The Core Problem
Imagine youβre building a database that needs to serve 100,000 users simultaneously. You canβt use a single server, itβll fail. So you replicate data across 3 servers (A, B, C).
But now: Which serverβs version of the data is "correct"?
If you write to Server A, then Server A crashes before telling B and C, what happens to your write? Did it succeed or fail? If you read from B, do you get stale data? What if B and C disagree about what the data should be?
This is the consensus problem. You need all servers to agree on the order and content of changes, even when some servers fail, get slow, or disappear from the network.
Why RAFT Solves It
RAFT is a consensus algorithm that solves this by:
- Electing a leader: One server takes charge and decides the order of writes
- Replicating to followers: The leader sends each write to all other servers
- Waiting for majority: Only when a majority confirms the write is complete
- Applying to state machine: All servers apply the same writes in the same order
This guarantees: Even if one server crashes, the others have the data and can elect a new leader.
Real-World Usage
- Consul (HashiCorp): Service discovery, uses RAFT to keep all nodes in sync
- Etcd (Kubernetes): Configuration store, RAFT ensures cluster state is consistent
- TiDB: Distributed SQL-RAFT ensures multiple replicas stay in sync
The RAFT paper (2014) was revolutionary because it made consensus understandable compared to older algorithms like Paxos. As the paper says: "RAFT prioritizes understandability over optimality."
Why This Matters: RAFTβs Role in Modern Distributed Systems
Let me connect this to business value, because this is why companies care.
Scaling to Millions of Requests
When Instagram wants to scale to 1 billion users, a single database server isnβt enough. They need:
- Replication: Store data on multiple servers (safety)
- Consensus: Make sure all replicas agree (correctness)
- Automatic failover: If one server dies, others take over (availability)
Without RAFT (or similar consensus), your system becomes:
- Inconsistent: Different replicas have different data
- Unreliable: Data loss when servers fail
- Unpredictable: No guarantees about what data you read
With RAFT:
- Consistent: All replicas agree on every write
- Reliable: Data survives single server failures
- Predictable: Guarantees about which writes are permanent
Why Google and Meta Care
Google runs Spanner (distributed database), Meta runs similar systems. Both need:
- Consensus for metadata consistency
- Automatic leader election when servers die
- Log replication without data loss
My RAFT implementation handles all of this. When a node crashes, the cluster elects a new leader in ~5-10 seconds. No data is lost. Reads work from any node (if you want to read from a follower). This is what production systems need.
Architecture & Design Decisions
Let me show you how I designed this system and why each choice matters.
System Overview
Key Design Decisions
1. Persistent Logs on Disk
def _persist_log_entry(self, log_entry):
"""Persist a single log entry to the LOG file (append-only)"""
with open(self.raft_terms.logs_file_path, 'a') as f:
entry = {
"index": log_index,
"term": log_entry["term"],
"command": log_entry["command"]
}
f.write(json.dumps(entry) + '\n')
Why? If a node crashes and restarts, it loads logs from disk. This is a RAFT requirement: "Servers must persist their current term and vote before responding to RPCs."
Without this: Node could vote twice in the same term (breaking consensus). With this: Node recovers safely.
2. Thread-Safe State Management
Every RAFT state change is protected by a lock:
with self.lock:
self.raft_terms.current_term += 1
self.raft_terms.state = RaftState.candidate
self.raft_terms.voted_for = self.raft_terms.id
Why? Multiple threads access RAFT state:
- Election timeout thread: checks if election should start
- Heartbeat thread: sends periodic messages
- RPC handler threads: processes incoming requests
Without this: Race condition could corrupt state. With this: All updates are atomic.
Note: I release the lock before blocking operations (RPC calls, disk I/O). This prevents starvation-other threads can proceed while waiting for network.
3. Log Replication with Consistency Check
This is the core of RAFTβs safety:
# Rule 3: Log consistency check
if health_check_arguments.prev_log_index >= 0:
if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
# We're missing entries
return {"success": False, "term": self.raft_terms.current_term}
prev_entry_local = obtain(self.raft_terms.logs[health_check_arguments.prev_log_index])
prev_term = int(prev_entry_local["term"])
if prev_term != health_check_arguments.prev_log_term:
# Term mismatch -> delete conflicting entries
self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
return {"success": False, "term": self.raft_terms.current_term}
# Rule 4: Append new entries
if health_check_arguments.entries:
insert_index = health_check_arguments.prev_log_index + 1
self.raft_terms.logs = self.raft_terms.logs[:insert_index]
for entry_ref in health_check_arguments.entries:
self.raft_terms.logs.append(entry_ref)
Why this approach?
RAFT ensures all servers have the same log history by checking the "previous entry" before appending new ones. If a follower is behind, the leader will eventually find the match point and fill in the gaps.
Why itβs safe:
- If prev_log_index/term donβt match: Follower rejects (leader will retry with earlier index)
- Once match found: All future entries are guaranteed to match
- Deleted entries were never committed (leader wouldnβt have replicated them to majority)
Deep Dive: Implementation Highlights
Now letβs look at the two phases of RAFT in action.
Phase 1: Leader Election (The Hard Part)
When a node doesnβt hear from the leader for 5-10 seconds, it starts an election:
def request_vote(self):
"""
PHASE 1: LEADER ELECTION
1. Increment currentTerm
2. Transition to CANDIDATE
3. Vote for self
4. Send RequestVote RPCs to all peers in PARALLEL
5. Count votes
6. If majority: become LEADER
"""
with self.lock:
# Step 1: Increment term and become candidate
self.raft_terms.current_term += 1
self.raft_terms.state = RaftState.candidate
self.raft_terms.voted_for = self.raft_terms.id
self.vote_count = 1
# Step 2: Persist before sending RPCs (safety requirement)
self._persist_state()
# Step 3: Get election timeout
self.reset_election_timer()
# Step 4: Prepare vote request with log information
vote_arguments = VoteArguments(
candidate_id=self.raft_terms.id,
current_term=self.raft_terms.current_term,
last_log_index=self.raft_terms.last_log_index,
last_log_term=self.raft_terms.last_log_term
)
peers = dict(self.raft_terms.peers)
# Step 5: Send RPCs in parallel (outside lock to prevent starvation)
with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
futures = []
for peer_id, peer_value in peers.items():
if peer_id == self.raft_terms.id:
continue
future = executor.submit(self._send_vote_request_to_peer, peer_id, peer_value, vote_arguments)
futures.append(future)
concurrent.futures.wait(futures)
# Step 6: Count votes AFTER all RPCs complete
with self.lock:
total_servers = len(self.raft_terms.peers) + 1
majority = (total_servers // 2) + 1
if self.vote_count >= majority:
self.raft_terms.state = RaftState.leader
# Initialize leader state for log replication
for peer_id in self.raft_terms.peers.keys():
self.next_index[peer_id] = len(self.raft_terms.logs)
self.match_index[peer_id] = -1
The Vote Decision Logic
When a node receives a vote request, it uses this logic:
def handle_request_vote(self, vote_arguments: VoteArguments):
"""
PHASE 1: LEADER ELECTION - Vote decision
Rules:
1. Reject if candidate's term < currentTerm
2. If candidate's term > currentTerm: update and become follower
3. Grant vote if:
- Haven't voted in this term OR already voted for this candidate
- Candidate's log is at least as up-to-date as receiver's log
"""
with self.lock:
# Rule 1: Reject stale term
if vote_arguments.current_term < self.raft_terms.current_term:
return False
# Rule 2: Higher term -> become follower
if vote_arguments.current_term > self.raft_terms.current_term:
self.raft_terms.current_term = vote_arguments.current_term
self.raft_terms.state = RaftState.follower
self.raft_terms.voted_for = -1
# Rule 3: Can we vote?
can_vote = (
self.raft_terms.voted_for == -1 or
self.raft_terms.voted_for == vote_arguments.candidate_id
)
# Rule 4: Is candidate's log up-to-date?
log_is_current = self._is_log_up_to_date(
vote_arguments.last_log_term,
vote_arguments.last_log_index
)
# Grant vote if both conditions met
if can_vote and log_is_current:
self.raft_terms.voted_for = vote_arguments.candidate_id
self.reset_election_timer()
return True
return False
Why is the log check necessary?
Imagine:
- Node A has 100 log entries
- Node B has 50 log entries
- Node C has 50 log entries
If B becomes leader, it will replicate its 50 entries to everyone, erasing the 50 committed entries on A. Disaster!
RAFT prevents this by checking: "Is your log at least as up-to-date as mine?" Specifically, it compares last_log_term and last_log_index. Aβs 100 entries (higher index) mean Aβs log is more up-to-date than Bβs 50 entries.
Phase 2: Log Replication (The Intricate Part)
This is where the magic happens. The leader continuously sends updates to followers.
The Replication Cycle
def send_health_checks(self):
"""
LEADER ONLY: Send periodic heartbeats to followers
Purpose:
1. Replicate log entries (or empty for heartbeat)
2. Prevent election timeouts (heartbeat resets timer)
3. Update follower commitIndex
"""
with self.lock:
if self.raft_terms.state != RaftState.leader:
return
followers = [peer_id for peer_id in self.raft_terms.peers.keys() if peer_id != self.raft_terms.id]
peers = dict(self.raft_terms.peers)
# Send to all followers in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
for peer_id, peer_value in peers.items():
if peer_id == self.raft_terms.id:
continue
executor.submit(self._send_health_check_to_peer, peer_id, peer_value)
Receiving and Appending Entries
When a follower receives an AppendEntries RPC:
def handle_health_check_request(self, health_check_arguments: HealthCheckArguments):
"""
FOLLOWER: Process AppendEntries RPC from leader
Steps:
1. Reject if term < currentTerm
2. Reset election timer (leader alive!)
3. Check log consistency (prev_log_index/term match)
4. Append new entries if consistent
5. Update commitIndex from leader signal
6. Apply committed entries to state machine
"""
with self.lock:
# Rule 1: Reject stale term
if health_check_arguments.current_term < self.raft_terms.current_term:
return {"success": False, "term": self.raft_terms.current_term}
# Rule 2: Valid leader -> become follower and reset election timer
if health_check_arguments.current_term >= self.raft_terms.current_term:
self.raft_terms.current_term = health_check_arguments.current_term
self.raft_terms.state = RaftState.follower
self.reset_election_timer()
# Rule 3: Log consistency check (THIS IS CRITICAL)
if health_check_arguments.prev_log_index >= 0:
if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
# Follower is missing entries
return {"success": False, "term": self.raft_terms.current_term}
# Check if the entry at prev_log_index matches term
prev_entry = self.raft_terms.logs[health_check_arguments.prev_log_index]
if prev_entry["term"] != health_check_arguments.prev_log_term:
# Mismatch! Delete conflicting entries
self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
self._truncate_log_file(health_check_arguments.prev_log_index)
return {"success": False, "term": self.raft_terms.current_term}
# Rule 4: Append new entries
if health_check_arguments.entries:
insert_index = health_check_arguments.prev_log_index + 1
self.raft_terms.logs = self.raft_terms.logs[:insert_index]
for entry in health_check_arguments.entries:
self.raft_terms.logs.append(entry)
self._persist_log_entry(entry)
# Update metadata
self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
self.raft_terms.last_log_term = self.raft_terms.logs[-1]["term"]
# Rule 5: Update commitIndex from leader
if health_check_arguments.leader_commit > self.raft_terms.commit_index:
old_commit = self.raft_terms.commit_index
self.raft_terms.commit_index = min(
health_check_arguments.leader_commit,
len(self.raft_terms.logs) - 1
)
print(f"Updated commitIndex: {old_commit} -> {self.raft_terms.commit_index}")
# Apply newly committed entries to state machine
self._apply_committed_entries()
return {"success": True, "term": self.raft_terms.current_term}
The State Machine Application (Where Data Becomes Real)
Hereβs the critical part - when entries actually get applied to the KV store:
def _apply_committed_entries(self):
"""
Apply all committed but not yet applied log entries to state machine.
RAFT Safety: Only apply when commitIndex > lastApplied
"""
while self.raft_terms.last_applied < self.raft_terms.commit_index:
self.raft_terms.last_applied += 1
log_entry = self.raft_terms.logs[self.raft_terms.last_applied]
print(f"Applying entry {self.raft_terms.last_applied}: '{log_entry['command']}'")
# Use state machine applier to apply to KV store
if self.state_machine_applier:
result = self.state_machine_applier.apply(log_entry["command"])
The state_machine_applier (a bridge between RAFT and KV store) deserializes and applies commands:
def apply(self, command_str: str) -> Dict[str, Any]:
"""Apply a serialized command to the state machine (KV store)."""
try:
command = deserialize_command(command_str)
cmd_type = command.get("type")
if cmd_type == CommandType.SET.value:
self.db.set_at(
command["key"],
command["field"],
command["value"],
command["timestamp"],
command.get("ttl")
)
elif cmd_type == CommandType.DELETE.value:
self.db.delete_at(
command["key"],
command["field"],
command["timestamp"]
)
return {"success": True, "operation": cmd_type}
except Exception as e:
return {"success": False, "error": str(e)}
Understanding the Critical Variables: A Concrete Example
This is where things get confusing, so let me walk through a real scenario:
Scenario: Client writes "SET user1=Alice" to 3-node cluster (A=leader, B=follower, C=follower)
TIME 0: Initial State
ββββββββββββββββββββ
All nodes:
logs = []
last_log_index = -1 # No entries in log
last_log_term = 0 # No entries, so term is 0
commit_index = -1 # Nothing committed yet
last_applied = -1 # Nothing applied to KV store yet
TIME 1: Client writes to Leader A
ββββββββββββββββββββββββββββββββ
Leader A receives: "SET user1=Alice"
Action: Append to log
logs = [
{term: 5, command: "SET user1=Alice"} β INDEX 0
]
last_log_index = 0 # UPDATED: Just added entry at index 0
last_log_term = 5 # UPDATED: That entry is in term 5
commit_index = -1 # NOT CHANGED: Still not committed!
last_applied = -1 # NOT CHANGED: Still not applied!
Entry is in log
NOT committed (only leader has it)
NOT applied to KV (waiting for replication)
Leader A sets next_index[B] = 1 and next_index[C] = 1
(Next entry to send to each follower)
TIME 2: Leader A sends AppendEntries to Followers
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Leader A creates AppendEntries RPC:
{
current_term: 5,
leader_id: A,
prev_log_index: -1, # Entry BEFORE the new ones
prev_log_term: 0, # Term of that entry
entries: [ # New entries to append
{term: 5, command: "SET user1=Alice"}
],
leader_commit: -1 # Leader's current commitIndex
}
Follower B receives:
Term 5 >= B's term 5 (valid leader)
Consistency check: prev_log_index=-1 matches (no previous entry)
Append new entry to log
logs = [
{term: 5, command: "SET user1=Alice"} β INDEX 0
]
last_log_index = 0 # UPDATED: Now has entry 0
last_log_term = 5 # UPDATED: Entry is in term 5
commit_index = -1 # NOT CHANGED: Waits for leader to say it's committed
last_applied = -1 # NOT CHANGED: Waiting for commit
Entry is in log
NOT committed yet (leader hasn't confirmed majority)
NOT applied to KV (waiting for commit)
Follower C receives: Same as B
logs = [{term: 5, command: "SET user1=Alice"}]
last_log_index = 0
last_log_term = 5
commit_index = -1
last_applied = -1
TIME 3: Leader Receives ACKs from Followers
βββββββββββββββββββββββββββββββββββββββββββ
Leader A receives success from B and C.
Update match_index:
match_index = {
A: 0, # I have index 0
B: 0, # B confirmed it has index 0
C: 0 # C confirmed it has index 0
}
Check for majority:
total_servers = 3 (A, B, C)
majority_needed = 2
servers_with_index_0 = [A, B, C] = 3 servers
3 >= 2 MAJORITY REACHED!
Leader A can NOW commit:
commit_index = 0 # UPDATED: Entry 0 is now committed
last_applied = -1 # Still not applied (will apply next)
Entry is committed (can't be lost)
NOT applied to KV (will happen on next apply cycle)
TIME 4: Leader Applies Entry
ββββββββββββββββββββββββββββ
Leader A checks: if commit_index (-1) < commit_index (0)?
Yes! new entries are committed.
Apply loop:
while last_applied (-1) < commit_index (0):
last_applied = 0 # UPDATED
entry = logs[0] = {SET user1=Alice}
state_machine_applier.apply(entry)
β KV store: {user1: Alice}
Entry applied to KV
Data is now durable
STATE AFTER APPLICATION:
logs = [{term: 5, command: "SET user1=Alice"}]
last_log_index = 0
last_log_term = 5
commit_index = 0 # Committed
last_applied = 0 # Applied to KV
KV store = {user1: Alice}
TIME 5: Leader Tells Followers About Commit
ββββββββββββββββββββββββββββββββββββββββββββ
In next heartbeat, Leader A sends:
{
current_term: 5,
leader_id: A,
prev_log_index: 0,
prev_log_term: 5,
entries: [], # Empty (just heartbeat)
leader_commit: 0 # β KEY: "Commit up to index 0!"
}
Follower B receives:
Check: leader_commit (0) > commit_index (-1)?
Yes!
commit_index = min(0, len(logs)-1) = 0 # UPDATED
Apply loop:
while last_applied (-1) < commit_index (0):
last_applied = 0 # UPDATED
entry = logs[0]
state_machine_applier.apply(entry)
β KV store: {user1: Alice}
last_applied = 0 # UPDATED
Follower C receives: Same as B
commit_index = 0
last_applied = 0
KV store = {user1: Alice}
FINAL STATE (After Replication Complete):
ββββββββββββββββββββββββββββββββββββββββββ
Leader A:
logs = [{term: 5, command: "SET user1=Alice"}]
last_log_index = 0
last_log_term = 5
commit_index = 0 β COMMITTED
last_applied = 0 β APPLIED
KV = {user1: Alice}
Follower B:
logs = [{term: 5, command: "SET user1=Alice"}]
last_log_index = 0
last_log_term = 5
commit_index = 0 β COMMITTED
last_applied = 0 β APPLIED
KV = {user1: Alice}
Follower C:
logs = [{term: 5, command: "SET user1=Alice"}]
last_log_index = 0
last_log_term = 5
commit_index = 0 β COMMITTED
last_applied = 0 β APPLIED
KV = {user1: Alice}
ALL 3 NODES HAVE IDENTICAL STATE
Why These Variables Matter
| Variable | Meaning | When Changes | Why Itβs Critical |
|---|---|---|---|
last_log_index | "How many entries have I received?" | When log appended | Tells followers/candidates if theyβre behind |
last_log_term | "What term was my last entry?" | When log appended | Ensures log consistency check works |
commit_index | "Up to which index can I trust?" | Leader counts replicas | If you apply an uncommitted entry and it gets lost, consistency breaks |
last_applied | "Up to which index have I applied to KV?" | When entries are applied | Must match commit_index eventually, or KV state diverges |
The Safety Guarantee:
- An entry is only applied when
last_appliedreaches it - This only happens when
commit_indexreaches it commit_indexonly advances when majority replicates- So: Only entries that canβt be lost are applied
Health Checks & Fault Tolerance
RAFT handles failures through two mechanisms:
1. Election Timeout (Detects Dead Leaders)
def _ticker(self):
"""
Election timeout thread
If heartbeat arrives before timeout: reset and wait again
If timeout fires: start election (become candidate)
"""
while not self.killed():
self.election_timer_event.clear()
# Randomize 5000-10000 ms to prevent ties
new_timeout_ms = random.uniform(5000, 10000)
self.raft_terms.election_timer = new_timeout_ms / 1000.0
# Wait for heartbeat OR timeout
heartbeat_received = self.election_timer_event.wait(
timeout=self.raft_terms.election_timer
)
# If heartbeat arrived, reset and retry
if heartbeat_received:
continue
# Timeout fired -> start election
self.request_vote()
Why randomization? Without it, all followers timeout simultaneously and you get split votes. With randomization, one node times out first and wins the election.
2. Heartbeat Timeout (Followers Detect Stale Leader)
Every 3sec, the leader sends heartbeats:
def _heartbeat_ticker(self):
"""
Heartbeat thread (leader only)
Sends AppendEntries RPCs every 3000ms
"""
while not self.killed():
if self.raft_terms.state == RaftState.leader:
self.send_health_checks()
self.heartbeat_timer_event.wait(timeout=3) # 3s interval (for clear visualization)
If a leader crashes, followers donβt receive heartbeats and election timeout fires. New leader elected within ~5-10 seconds.
Handling Real-World Failures
Building RAFT taught me that the algorithmβs real strength isnβt in the happy path - itβs in how it handles failures that would crash other systems. Let me walk through the 4 most critical failure scenarios and show exactly how my implementation prevents them.
Failure #1: Split Brain - Multiple Leaders Elected Simultaneously
The Problem
Imagine this nightmare scenario:
- You have 5 nodes in your cluster
- Due to network latency, nodes A and B think C is dead
- They start elections simultaneously
- Both get 2 votes (A votes for itself, B votes for itself)
- Both think theyβre leaders
- Now your system has two sources of truth
Writes go to Leader A, other writes go to Leader B. Data diverges. Consistency breaks. Database corrupts.
How My Code Prevents It
The answer is majority voting. A node can only become leader if it gets votes from MORE than half the cluster.
def request_vote(self):
"""
PHASE 1: LEADER ELECTION
Only candidates with majority votes become leaders
"""
with self.lock:
# Step 1: Increment term and become candidate
self.raft_terms.current_term += 1
self.raft_terms.state = RaftState.candidate
self.raft_terms.voted_for = self.raft_terms.id
self.vote_count = 1 # Vote for self
# ... send vote requests to all peers ...
# Step 2: Count votes AFTER all RPCs complete
with self.lock:
total_servers = len(self.raft_terms.peers) + 1
majority = (total_servers // 2) + 1 # β KEY: Strict majority needed
print(f"Election result: {self.vote_count}/{total_servers} votes (need {majority})")
# Step 3: Only become leader if majority
if self.vote_count >= majority and self.raft_terms.state == RaftState.candidate:
self.raft_terms.state = RaftState.leader
Why this works:
In a 5-node cluster:
- Majority needed = (5 // 2) + 1 = 3 votes
- If Node A gets 2 votes (itself + one other): NOT leader
- If Node B gets 2 votes (itself + one other): NOT leader
- Maximum: Both get 2 votes each. Neither reaches 3. No split brain
In a 3-node cluster:
- Majority needed = (3 // 2) + 1 = 2 votes
- Only one node can get 2+ votes (the other node canβt get 2 if the first got 2)
- Exactly one leader per term guaranteed
Real-World Impact
Without this: Instagramβs database could have two leaders writing conflicting data. With this: Mathematically impossible.
Failure #2: Node Crashes Before Sending Requests to Followers
The Problem
A client writes to the leader:
- Leader appends to its log
- Leader returns success to client ("your write is saved!")
- Leader crashes before sending to followers
- Followers never get the entry
- New leader elected
- Entry is gone
Client thinks data is saved, but itβs lost.
How My Code Prevents It
The answer is donβt commit (and donβt acknowledge to client) until majority replicates.
def append_log_entries(self, command):
"""
CLIENT WRITE: Only leader accepts writes
But leader doesn't acknowledge until replicated to majority
"""
with self.lock:
# Only leader accepts
if self.raft_terms.state != RaftState.leader:
return {
"success": False,
"error": "NOT_LEADER",
"message": "Redirect to actual leader"
}
# Append to leader's log (safe, it's the leader)
log_entry = {
"term": self.raft_terms.current_term,
"command": str(command)
}
self.raft_terms.logs.append(log_entry)
log_index = len(self.raft_terms.logs) - 1
# Persist to disk (safety requirement)
self._persist_log_entry(log_entry)
# Update metadata
self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
self.raft_terms.last_log_term = self.raft_terms.current_term
print(f"Leader appended entry at index {log_index}")
# Return to client: entry is in log, waiting for replication
return {
"success": True,
"index": log_index,
"term": self.raft_terms.current_term,
"message": "Appended to leader log, waiting for replication"
}
Client gets success, but this means "appended to leaderβs log, replication in progress."
Then the leader sends to followers in parallel:
def send_health_checks(self):
"""
LEADER ONLY: Send AppendEntries RPCs to all followers
This replicates log entries to majority
"""
with self.lock:
if self.raft_terms.state != RaftState.leader:
return
peers = dict(self.raft_terms.peers)
# Send to all followers IN PARALLEL
with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
for peer_id, peer_value in peers.items():
if peer_id == self.raft_terms.id:
continue
# Each follower gets a separate RPC thread
executor.submit(self._send_health_check_to_peer, peer_id, peer_value)
When majority ACKs:
# After followers respond...
with self.lock:
# Update match_index for each peer that ACKed
# Example: match_index = {A: 5, B: 5, C: 5}
# Count how many nodes have replicated up to index 5
replicated_count = sum(
1 for peer_id in self.raft_terms.peers
if self.match_index.get(peer_id, -1) >= log_index
)
replicated_count += 1 # Count self
# Check if majority has it
total_servers = len(self.raft_terms.peers) + 1
majority = (total_servers // 2) + 1
if replicated_count >= majority:
# NOW it's safe to commit
self.raft_terms.commit_index = log_index
self._apply_committed_entries() # Apply to KV store
Why this works:
- Entry added to leaderβs log
- Leader sends to all followers in parallel
- Wait for majority ACKs
- Only then advance
commit_index - If leader crashes before majority ACKs: Entry is lost, but client wasnβt told "persisted"
- If leader crashes after majority ACKs: New leader has entry (majority has it), replicates it
Guarantee: If client gets success with commit_index >= index, data survives any single failure.
Real-World Impact
Without this: Bank deposits might be lost even though customer got confirmation. With this: Confirmation is only sent when data canβt be lost.
Failure #3: Log Divergence - After Crashes, Followers Have Conflicting Logs
The Problem
Network partition scenario:
-
Leader A gets 2 writes: [entry1, entry2]
-
Followers B and C donβt get entry2 (partition)
-
Network heals
-
Now logs diverge:
-
A: [entry1, entry2]
-
B: [entry1]
-
C: [entry1]
Whoβs right? How do they reconcile?
How My Code Prevents It
The answer is log consistency check with prev_log_index and prev_log_term. Leader sends entries with context.
def handle_health_check_request(self, health_check_arguments: HealthCheckArguments):
"""
FOLLOWER: Process AppendEntries from leader
CRITICAL: Check that previous entry matches before appending new ones
This prevents log divergence
"""
with self.lock:
# Rule 1: Reject stale leader
if health_check_arguments.current_term < self.raft_terms.current_term:
return {"success": False, "term": self.raft_terms.current_term}
# Rule 2: Reset election timer (valid leader)
if health_check_arguments.current_term >= self.raft_terms.current_term:
self.raft_terms.current_term = health_check_arguments.current_term
self.raft_terms.state = RaftState.follower
self.reset_election_timer()
# Rule 3: LOG CONSISTENCY CHECK β KEY
# Check if we have the "previous" entry that leader is building on
if health_check_arguments.prev_log_index >= 0:
# Do we have enough entries?
if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
print(f"Log too short: have {len(self.raft_terms.logs)}, "
f"need {health_check_arguments.prev_log_index + 1}")
return {"success": False, "term": self.raft_terms.current_term}
# Does the previous entry's term match?
prev_entry = self.raft_terms.logs[health_check_arguments.prev_log_index]
if prev_entry["term"] != health_check_arguments.prev_log_term:
# Term mismatch! Our logs diverge
print(f"Log conflict at index {health_check_arguments.prev_log_index}")
# β KEY: Truncate diverging entries
self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
self._truncate_log_file(health_check_arguments.prev_log_index)
return {"success": False, "term": self.raft_terms.current_term}
# Rule 4: Logs match! Append new entries
if health_check_arguments.entries:
insert_index = health_check_arguments.prev_log_index + 1
self.raft_terms.logs = self.raft_terms.logs[:insert_index]
for entry in health_check_arguments.entries:
self.raft_terms.logs.append(entry)
self._persist_log_entry(entry)
print(f"Follower appended {len(health_check_arguments.entries)} entries")
return {"success": True, "term": self.raft_terms.current_term}
How it resolves conflicts:
Scenario: Follower has [entry1_term2, entry2_term2] but leader says prev_log_index=0, prev_log_term=3
Leader says: "Before appending new entries, check your entry at index 0"
Follower checks: "My entry at index 0 has term=2"
Leader claims: "term=3"
Mismatch! Follower realizes: "My logs diverged from leader"
Follower truncates: logs = [] (delete entry2)
Follower responds: "No, previous entry doesn't match"
Leader retries: "OK, try with earlier index"
Eventually leader finds match point (or follower has nothing)
Leader then replicates new entries
All logs converge
Why this works:
- Leader knows its own log history exactly
- Sends
prev_log_indexandprev_log_termwith each RPC - Follower checks before appending
- If mismatch: follower deletes conflicting entries
- Leader backs up and retries with earlier indices
- Eventually logs converge (all followers have leaderβs log)
- Guarantee: All servers eventually have identical logs
Real-World Impact
Without this: After network partition, some nodes have different data. With this: Logs converge automatically, consistency restored.
Failure #4: Uncommitted Data Loss - Node Crashes Before Applying to KV Store
The Problem
Data gets committed (majority replicated), but:
- Follower receives it
- Follower crashes before applying to KV store
- Follower restarts
- Did it recover the committed entry? Or is it lost?
If lost: KV store on this node diverges from others. Inconsistency.
How My Code Prevents It
The answer is persist logs to disk before applying, and recover logs on restart.
def _persist_log_entry(self, log_entry):
"""Persist a single log entry to the LOG file (append-only)"""
try:
log_index = len(self.raft_terms.logs) - 1
with open(self.raft_terms.logs_file_path, 'a') as f:
entry = {
"index": log_index,
"term": log_entry["term"],
"command": log_entry["command"]
}
f.write(json.dumps(entry) + '\n') # β Persisted to disk
except Exception as e:
print(f"Error persisting log entry: {e}")
When node restarts:
def _load_persistent_state(self):
"""Load persistent state from disk"""
if not os.path.exists(self.raft_terms.logs_file_path):
return
# Load logs from disk
with open(self.raft_terms.logs_file_path, 'r') as f:
for line in f:
entry = json.loads(line)
self.raft_terms.logs.append({
"term": entry["term"],
"command": entry["command"]
})
# Recover metadata
if self.raft_terms.logs:
self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
self.raft_terms.last_log_term = self.raft_terms.logs[-1]["term"]
Then recover the state machine:
def _recover_state_machine_from_log(self):
"""
After loading logs from disk, re-apply all committed entries
to state machine (KV store)
"""
# Re-apply all committed entries
while self.raft_terms.last_applied < self.raft_terms.commit_index:
self.raft_terms.last_applied += 1
log_entry = self.raft_terms.logs[self.raft_terms.last_applied]
# Re-apply to KV store
if self.state_machine_applier:
self.state_machine_applier.apply(log_entry["command"])
Why this works:
Scenario: Follower crashes with 10 committed entries
Before crash:
logs (disk): [entry1, entry2, ..., entry10]
KV store: {data from applied entries}
commit_index: 9 (committed up to entry9)
last_applied: 7 (only applied up to entry7)
Node crashes
Node restarts:
Load logs from disk: logs = [entry1, entry2, ..., entry10]
Load commit_index: commit_index = 9
last_applied = 7
Recovery loop:
While last_applied (7) < commit_index (9):
last_applied = 8
Apply logs[8] to KV store
last_applied = 9
Apply logs[9] to KV store
KV store now has entries 1-9 applied
Matches the other nodes
Guarantee: Even if node crashes mid-way through applying, it recovers and completes on restart.
Real-World Impact
Without this: After node crash, some replicas have different state. With this: All replicas recover to identical state automatically.
Results: What I Built
The 3-Node Cluster
Production Features Implemented
KV Store with Advanced Features:
# Basic operations
set(key="user1", field="name", value="Alice", timestamp=1234567890, ttl=3600)
get(key="user1", field="name", timestamp=1234567890)
delete(key="user1", field="email", timestamp=1234567890)
# TTL (Time-To-Live) Support
# Fields expire after TTL seconds
def is_alive(self, timestamp: int) -> bool:
if self.ttl is None:
return True
return timestamp <= self.created_at + self.ttl
# Scan operations
scan(key="user1", timestamp=1234567890) # Get all fields for a key
scan_by_prefix(key="user1", prefix="email_", timestamp=1234567890) # Prefix matching
Real-Time Dashboard:
- Shows 3-node cluster status
- Live node state (LEADER/FOLLOWER)
- Term and vote information
- Log entries with commit status
- Elections in progress
- Heartbeat activity
- State machine changes
WebSocket Visualization:
# Real-time events broadcast to connected browsers
await ws_manager.broadcast_heartbeat(...) # Leader sending heartbeats
await ws_manager.broadcast_election_result(...) # New leader elected
await ws_manager.broadcast_log_entry(...) # Entry appended
await ws_manager.broadcast_entries_committed(...) # Majority replicated
await ws_manager.broadcast_kv_store_update(...) # Applied to KV
Getting Started: Run This Locally
Want to see RAFT in action on your machine?
Prerequisites
# Python 3.8+
python --version
# Install dependencies
python3 -m venv raft-venv
source raft-venv/bin/activate
pip install rpyc fastapi uvicorn python-multipart
All dependencies from requirements.txt:
rpyc # Remote procedure calls
fastapi # API server
uvicorn # ASGI server
python-multipart # Form handling
Step 1: Start the RAFT Cluster
# In terminal 1
python3 start_cluster.py
# In terminal 2
cd client/raft-visualization
npm run dev
This starts:
- 3-node RAFT cluster (A, B, C)
- FastAPI server on http://localhost:8765
- WebSocket server on ws://localhost:8765/ws
Youβll see:
RAFT Cluster with WebSocket Server
WebSocket Port: 8765
React should connect to: ws://localhost:8765/ws - http://localhost:5173/
RAFT Cluster Startup - Phase 1: Initialize RPC Servers
...
RAFT Cluster Startup - Phase 2: Begin Leader Election
...
Node A elected as LEADER in term 1
Step 2: Write Data (Make Requests)
# In terminal 2, use curl to write data
# Set a key-value pair
curl -X POST http://localhost:8765/kv-store \
-H "Content-Type: application/json" \
-d '{
"type": "set",
"command": "SET",
"key": "user1",
"field": "name",
"value": "Alice"
}'
Response:
{
"success": true,
"message": "KV store entry submitted for replication",
"key": "user1",
"field": "name",
"value": "Alice",
"timestamp": 1704067890000000,
"log_index": 0,
"term": 1
}
# Read it back
curl "http://localhost:8765/kv-store?key=user1&field=name"
Response:
{
"success": true,
"message": "Data retrived successfully",
"value": "Alice"
}
# Delete
curl -X DELETE "http://localhost:8765/kv-store?key=user1&field=name"
Step 3: Visualize in Real-Time
Connect the included React dashboard to ws://localhost:8765/ws and watch:
- Heartbeats flowing (every 3sec)
- Elections triggering (when leader dies)
- Log entries replicating (client writes)
- State machine applying (entries commit)
Key Learnings & Lessons
After building this, hereβs what I discovered:
1. Consensus is Harder Than It Looks
The RAFT paper is 13 pages. I thought: "How hard can it be?"
Very hard. The subtleties:
- Log matching property: Followers must check
prev_log_indexandprev_log_termbefore appending. Without this, logs diverge. - Commit safety: Only apply when majority replicates. One node applying uncommitted data = consistency violation.
- Tie-breaking: Randomized election timeouts prevent split votes. This one line of randomness (5000-10000ms) is critical.
2. Small Bugs Cascade into Catastrophic Failures
I had a bug once: not resetting the election timer when becoming follower. Result?
- Node gets heartbeat from leader
- Becomes follower
- But doesnβt reset timer
- Timer fires after 3000ms
- Node starts election
- Leader split-brain between old and new leader
This bug showed me: RAFT is elegant because it prevents these cascades. The algorithm is designed so small mistakes donβt compound.
3. Threading Requires Discipline
I released locks before blocking operations (RPC calls, disk I/O). Why?
If I held the lock during RPC:
- Node sends RequestVote to peer
- Peer is slow (network latency)
- Lock is held for seconds
- Other threads (heartbeat, RPC handlers) starve
- Everything slows down
By releasing the lock:
- Lock held for microseconds (just state modification)
- RPC happens without lock
- Other threads proceed
- System stays responsive
This pattern critical section, then blocking, then critical section again is subtle but essential.
4. Persistence Prevents Disasters
I persist to disk before:
- Voting in elections (canβt vote twice)
- Appending entries (canβt lose data on crash)
Without this: Node crashes, restarts, votes again in same term = election breaks.
5. Real-Time Visualization is a Superpower
Building the WebSocket dashboard made RAFT visible. Watching:
- Leader send heartbeats
- Followers append entries
- CommitIndex advance
- State machine apply
...made the abstract algorithm concrete. Problems jumped out immediately.
Resources & References
If you want to dive deeper into RAFT and distributed systems, here are the resources that shaped this project:
My Implementation
Complete RAFT Implementation: GitHub Repository
- Full source code for the 3-node cluster, KV store, tests, and WebSocket visualization. Feel free to fork, study, or build upon it.
Video Walkthrough: Click here to view a short video of my implementation
- A 1-minute demo showing the cluster in action: leader election, log replication, applying to state machine
Learning Materials
-
MIT 6.824: Distributed Systems (YouTube Playlist): Watch the course
-
This course was my primary inspiration. Professor Robert Morris explains distributed systems with clarity and depth. If youβre serious about understanding consensus, replication, and fault tolerance, this is mandatory viewing.
Technical References
-
The RAFT Consensus Algorithm Paper: Read the paper
-
The original paper by Diego Ongaro and John Ousterhout. Itβs remarkably readable compared to other consensus algorithms. The visualizations alone are worth studying.
Letβs Discuss
Got questions about RAFT? Spotted an edge case I missed? Want to discuss consensus algorithms, distributed systems, or database architecture?
Iβd love to hear from you. Drop a comment below or reach out on LinkedIn.
One last thing: Building RAFT taught me that learning never truly ends in software engineering. Thereβs always a deeper level to understand, a new failure scenario to handle, another optimization to explore.
So hereβs to the next challenge. See you there.