Fault-Tolerant Replication and Automated Failover in Distributed In-Memory Systems

A precise overview of leader–replica replication, PSYNC-based resynchronisation, backlog buffering, and Sentinel-driven failover, explaining how modern in-memory systems maintain availability and data continuity during network failures and node crashes, as exemplified by .High-availability in distributed systems is not achieved by replication alone, but by replication that can survive disconnections, partial failures, and leadership changes without data loss or long recovery pauses. The leader–replica replication model establishes the foundational contract: a single authoritative write source with one or more replicas continuously applying changes. While conceptually simple, this model becomes fragile under real network conditions where replicas disconnect, fall behind, or rejoin with inconsistent state.This fragility is addressed through a tightly coupled set of algorithms. The PSYNC replication protocol replaces naive full synchronization with a state-aware mechanism that determines whether a replica can safely continue from its previous position or must perform a full resync. Partial resynchronization further narrows this window by allowing replicas to catch up using only the missing command stream, rather than retransferring the entire dataset. The replica backlog buffer makes this possible by retaining a bounded, in-memory history of recent write commands, effectively acting as a recovery window for transient failures.Finally, availability is meaningless without automated recovery from leader failure. The failover election algorithm, as implemented by Sentinel, coordinates multiple observers to detect leader unavailability, reach quorum, elect a new leader, and reconfigure replicas — all while minimizing split-brain risk and write inconsistency. Together, these mechanisms form a replication system designed not for ideal networks, but for failure as the normal operating condition.Leader–replica replication model (Redis)A single authoritative writer (the leader) executes all mutations on a mutable in-memory dataset. The system must publish those mutations so that one or more replicas can reconstruct the same state by replay, while the leader continually accepts new writes without blocking. The design must satisfy:every mutation is assigned a single place in a total order and replicas apply mutations in that exact order;replicas deterministically reach the same logical state as the leader for any prefix of the mutation stream they have applied;replicas that disconnect may later reconnect and, when possible, resume by replaying only the missing suffix (avoiding full state transfer);replication work must impose bounded, predictable memory/CPU/network costs on the leader (backlog size, per-replica buffers, snapshot costs); andthe system must provide explicit, safe recovery semantics for leader crashes/restarts and network partitions.This is fundamentally a problem about externalizing the leader’s timeline of mutations as a replayable, bounded stream while protecting the leader’s ability to make forward progress.Core invariants (must hold in every execution) the leader produces a strictly monotonic offset sequence 1..N for committed mutations; replicas must apply entries strictly by offset.Deterministic application: any nondeterministic effect is resolved at the leader and published as concrete state to be applied by replicas.leader_id; offsets from different leader_id values are not directly comparable.Bounded replication cost:per_replica_outbuf_limit, and snapshot_budget are configured and enforced by the leader; replication cannot cause unbounded memory or block the writer. on reconnect the leader atomically decides whether the replica can resume from its last offset given current backlog and leader_id or must perform full state bootstrap.Minimal primitives the design must implement{leader_id, offset, payload, entry_checksum}. circular buffer indexed by offset, supporting read_from(offset, max_bytes). bounded FIFO of bytes for nonblocking write scheduling.Snapshot exporter/streamer: produce a point-in-time snapshot at offset_snap and stream it chunked to replicas while leader continues to append to backlog.Event-driven, non-blocking I/O primitives: readiness notifications, partial writes, epoll/kqueue semantics.Rolling and full snapshot checksums: for divergence detection. metrics for backlog fill, outbuffer sizes, resume success rate, full-sync frequency.Phase 0 — initialization (leader)Generate leader_id (unique token for the lifetime of this process).Initialize current_offset = 0.Create Backlog(max_bytes = backlog_size) and reserve per-replica buffer budget.Start event loop for nonblocking I/O and background workers (snapshotter, checksum worker).Phase 1 — write path (single writer)On each write: assign next offset, append to backlog, enqueue serialized entry to each replica outbuffer in a nonblocking O(1) step, and persist durable markers if configured.Pseudocode: leader apply writefunction leader_apply_local_write(command): current_offset += 1 entry = make_entry(leader_id, current_offset, command) backlog.append(entry) // evict oldest entries deterministically if needed for each replica in connected_replicas: if replica.connected: ok = replica.outbuffer.enqueue(entry.serialized) if not ok: handle_replica_backpressure(replica) // optionally persist the entry or durable offset marker herePhase 2 — backlog behavior (append/evict/read)Backlog holds a contiguous window [start_offset .. end_offset]. Eviction is deterministic: when appending would exceed max_bytes, evict the oldest entry(ies) and advance start_offset. The leader exposes start_offset and end_offset for resume checks.class Backlog: constructor(max_bytes): allocate circular buffer of max_bytes start_offset = 1 while free_bytes() = start_offset and offset end_offset: return ERROR_OFFSET_NOT_AVAILABLE return read_serialized_by_offset(offset, max_bytes) function evict_oldest_entry(): oldest = read_entry_at_offset(start_offset) advance_head_by(oldest.length) start_offset += 1Phase 3 — reconnect logic (resume vs full bootstrap)When a replica reconnects it provides (last_leader_id, last_offset) persisted locally. Leader executes the atomic decision algorithm:Pseudocode: resume decisionfunction decide_resume(replica_last_leader_id, replica_last_offset): if replica_last_leader_id != leader_id: return {resume_allowed: false, reason: “leader_id_changed”} if replica_last_offset current_offset: return {resume_allowed: false, reason: “offset_in_future”} return {resume_allowed: true, resume_from: replica_last_offset + 1}If resume_allowed is true, leader will stream the suffix from resume_from using backlog reads. If false, the leader initiates snapshot streaming beginning at some offset_snap (commonly current_offset at snapshot start) and then streams the suffix starting after offset_snap.Phase 4 — nonblocking replica send loopEach replica has an I/O loop that writes from its outbuffer to the socket with partial writes and readiness notifications. Outbuffer size is strictly bounded. Exceeding the bound for longer than a configured grace time triggers the configured enforcement action (drop replica, throttle writers, alert).Pseudocode: replica I/O loopfunction replica_io_loop(replica): while replica.connected: if socket_ready_for_write(replica.socket) and not replica.outbuffer.empty(): chunk = replica.outbuffer.peek() n = nonblocking_write(replica.socket, chunk) if n per_replica_outbuf_limit: if time_since_exceed(replica) > outbuf_grace_seconds: enforce_backpressure_policy(replica) wait_for_event_or_enqueue_notification(replica)Phase 5 — snapshot streaming (full bootstrap)Snapshot must represent the leader state at offset_snap and be accompanied by a clear plan for the suffix. Use fork+CoW or a background snapshot generator. Leader must ensure offset_snap is preserved in backlog or recorded as the snapshot anchor.Pseudocode: snapshot streamerfunction stream_snapshot_to_replica(replica): offset_snap = current_offset // atomically captured if pid == 0: // child: produce snapshot for chunk in produce_snapshot_chunks(): write_chunk_to_socket(replica.socket, chunk) write_snapshot_end_marker(offset_snap, snapshot_checksum) exit(0) // parent continues to accept writes, backlog retains entries beyond offset_snap monitor_child_snapshot(pid, replica)After snapshot completes, leader streams the suffix starting at offset_snap + 1 (which must be present or produced in backlog).Phase 6 — checksums and divergence handlingPeriodic rolling hashes of the stream or full-snapshot checksums detect silent divergence. If a mismatch is detected at offset k, force full bootstrap for that replica and alert operators.Pseudocode: rolling checksum comparefunction compute_rolling_hash(up_to_offset): H = INITIAL for chunk_offset from 1 to up_to_offset step CHUNK: entries = backlog.read_from(chunk_offset, BYTES_LIMIT) H = hash_combine(H, entry.entry_checksum, entry.offset)function verify_replica_checksum(replica, offset, replica_hash): leader_hash = compute_rolling_hash(offset) if leader_hash != replica_hash: mark_replica_divergent(replica, offset) schedule_snapshot(replica) alert_operator(replica, offset)Phase 7 — enforcement / failure reactionsLeader must choose and enforce one of these deterministic policies: drop replica when outbuffer limit exceeded. apply write-throttling to local writes (dangerous for latency).Backlog expansion request: log suggestion to operator; do not auto-expand unless operator allows.Pseudocode: backpressure enforcementfunction enforce_backpressure_policy(replica): if policy == “drop”: close_connection(replica, reason=“outbuf_overflow”) else if policy == “throttle”: enable_local_write_throttling() else: log(“operator_action_required: increase backlog or reduce replica count”)Uncompromising logic and trade-off analysisSingle-writer total order is mandatory. Any design allowing independent writes on two nodes requires reconciliation logic and breaks simple replay semantics. The design therefore enforces single-writer semantics and treats leader identity as authoritative.Leader latency must be sacrosanct. Allowing replication to stall the leader compromises correctness and client experience. Therefore prefer policies that disconnect slow replicas rather than block the leader. Operators who want stronger durability must provision larger backlog and better network links.Backlog is a hard engineering knob. Increasing backlog reduces frequency of full snapshot transfers but increases memory and snapshot CoW pressure. Compute backlog by p95 observed write rates and target retention window; use formula backlog_bytes = rate_cmds_per_sec × avg_serialized_bytes × retention_seconds. Use peak metrics for safety.Nondeterminism must be resolved at the leader. Replicating instructions that produce differing results on different machines is an invitation to silent divergence. Evaluate at leader and publish concrete payloads.Snapshot is unavoidable for long outages or leader restarts. Treat snapshot as first-class: ensure snapshot streaming is nonblocking and that the suffix anchor offset is registered atomically so replicas can deterministically resume after snapshot.Checksums are guardrails, not substitutes for correctness. Periodic rolling checks detect problems; on mismatch force full bootstrap and investigate; do not attempt automatic merge.Security and resource limits are operational constraints. TLS and auth are recommended but increase CPU; define maximum replicas or offload TLS if necessary.Expected outcomes (what the system guarantees when configured & operated correctly)replicas that reconnect within the backlog window resume by applying only the missing suffix and converge to leader state;the leader never stalls due to replication activity under configured enforcement policy;full snapshot transfers occur only when unavoidable (leader change or backlog eviction) and are streamed without stopping the leader;divergence is detected via checksums and causes safe remediation (snapshot + operator alert); andoperators can compute numeric budgets for backlog, per-replica buffers, and snapshot memory from measured rates.Operational controls and numeric examplemeasure p95 write rate and average serialized command size; choose retention window in seconds; compute backlog bytes using the formula above.set per_replica_outbuf_limit by expected network RTT and desired replica slack (common default: 16 MiB).set outbuf_grace_seconds (default: 30s).snapshot memory budget should be based on expected CoW growth during snapshot (keep below ~20% of free RAM unless using on-disk backlog).Example quick calculation: p95 = 2,500 cmd/s, avg size = 300 bytes → 750,000 B/s (≈ 0.75 MB/s). For 10 minutes retention: 0.75 MB/s × 600 s = 450 MB backlog.Redis operates as a single-writer system where all mutations must be totally ordered to preserve correctness. At the same time, it must support horizontal read scalability, redundancy, and fast recovery from failures. The challenge is to replicate a mutable in-memory dataset from one authoritative node to one or more replicas such that:all replicas converge to the same state as the leader;write ordering is preserved exactly;replicas can serve reads without violating consistency guarantees chosen by the system;replicas can disconnect and later rejoin without full data loss when possible;replication overhead does not block or stall the leader’s event loop;the system can recover from crashes, restarts, and network partitions.A naïve approach (periodically copying the entire dataset) is too expensive and introduces long pauses and large bandwidth spikes. The system therefore needs a continuous replication mechanism that streams changes incrementally, preserves ordering, and supports recovery from partial failures.The leader–replica replication model solves this by designating a single leader as the source of truth for all writes and streaming a deterministic sequence of state changes to replicas, which apply those changes in the same order.Core invariants and guaranteesSingle writer: only the leader accepts write commands that mutate state; replicas never generate writes.Total order of writes: every mutation is assigned a strictly increasing replication offset that defines a global order.Deterministic replay: replicas apply the same write stream in the same order, guaranteeing state convergence.Append-only propagation: replication traffic is a sequential stream of commands or state deltas, never random access updates.Replica lag tolerance: replicas may be behind the leader but never ahead.Crash safety boundary: replicas acknowledge replication offsets so the leader can reason about what data has been received.Resynchronization paths: replicas that fall too far behind must perform a full resynchronization; otherwise, incremental resync is used.Leader node maintains the authoritative dataset and executes all write commands.Each write command produces a serialized representation suitable for replay.The leader maintains a replication stream buffer indexed by offsets.Replicas establish a replication connection and consume the stream sequentially.Replicas maintain their own replication offset to track progress.STRUCT Leader: dataset repl_backlog dataset state L = new Leader L.dataset = empty_dataset() L.repl_offset = 0 L.repl_backlog = empty_log() L.replicas = empty_list() R = new Replica R.dataset = empty_dataset() R.master_offset = 0 RETURN RFUNCTION leader_execute_write(L, command): apply_command(L.dataset, command) serialized = serialize(command) L.repl_offset = L.repl_offset + length(serialized) append_to_backlog(L.repl_backlog, serialized, L.repl_offset) FOR each R IN L.replicas: send_to_replica(R, serialized) RETURN LFUNCTION replica_receive(R, serialized, new_offset): apply_serialized(R.dataset, serialized) R.master_offset = new_offset RETURN RFUNCTION replica_connect(R, L): send_psync_request(R, R.master_offset) L = leader_handle_psync(L, R) RETURN (R, L)FUNCTION leader_handle_psync(L, R): IF backlog_can_satisfy(L.repl_backlog, R.master_offset): data = backlog_from_offset(L.repl_backlog, R.master_offset) send_to_replica(R, data) add_replica(L.replicas, R) ELSE: snapshot = create_snapshot(L.dataset) send_snapshot(R, snapshot) R.dataset = load_snapshot(snapshot) R.master_offset = L.repl_offset add_replica(L.replicas, R) R.state = “online“FUNCTION replica_disconnect(R): R.state = “disconnected“FUNCTION leader_remove_replica(L, R): remove_replica(L.replicas, R)FUNCTION replica_apply_stream(R, stream): FOR each entry IN stream: apply_serialized(R.dataset, entry.data) R.master_offset = entry.offsetFUNCTION leader_trim_backlog(L): min_offset = minimum_replica_offset(L.replicas) trim_backlog_before(L.repl_backlog, min_offset) RETURN LExecution flow (operational sequence)Initialization: leader starts with an empty dataset and replication offset set to zero; replicas start disconnected with offset zero.Write propagation: each write executed on the leader mutates the dataset, is serialized, assigned a new replication offset, appended to the backlog, and streamed to all connected replicas.Replica application: replicas apply incoming serialized commands in order and advance their local offset.Replica connection: when a replica connects, it sends its last known offset. If the leader’s backlog still contains data after that offset, incremental resynchronization is used; otherwise a full snapshot is sent.Disconnection and recovery: if a replica disconnects temporarily, it can later reconnect and resume from its last acknowledged offset if backlog coverage allows.Backlog maintenance: the leader periodically trims the replication backlog based on the slowest replica to bound memory usage.System-level rationale and trade-offsSingle-writer discipline simplifies consistency by avoiding write conflicts and distributed consensus for normal operation.Streaming serialized commands rather than state diffs keeps replication simple and deterministic but couples replicas tightly to the leader’s execution order.Replication offsets provide a linear history that enables precise recovery, lag measurement, and backlog trimming.Incremental resynchronization minimizes bandwidth and recovery time for transient failures, while full resynchronization provides a correctness fallback.Replicas can serve reads to offload the leader, but read-after-write consistency depends on replica lag and client routing.The model favors availability and simplicity over multi-leader write scalability; promotion and failover are handled by separate control-plane mechanisms.Replication PSYNC algorithmIn a leader–replica replication system, replicas are not continuously connected. They may disconnect due to network interruptions, restarts, crashes, or temporary overload. When a replica reconnects, replication must continue in a way that preserves correctness.Two fundamentally different recovery paths exist:Incremental resynchronisation — resume replication from the exact point where the replica stopped. — discard replica state and transfer the entire dataset again (snapshot plus subsequent mutations).Incremental resynchronisation is desirable because it is cheap. Full resynchronisation is always safe but expensive. However, resuming without proof of compatibility risks silent divergence, corrupted state, and irreversible inconsistency.The system must therefore answer a precise question at reconnect time:Is the replica’s current state still a valid prefix of the leader’s mutation history?This question must be answered:without inspecting dataset contents;without blocking the leader’s write path;with a deterministic outcome;with correctness taking precedence over efficiency.PSYNC is the algorithm that answers this question.Given a reconnecting replica that reports its last known replication identity and offset, determine whether replication can safely resume from that offset or must restart from a full dataset transfer, using a constant-time, deterministic decision procedure. A replica may resume only if it belongs to the same mutation history as the leader. A replica may resume only if its applied state is a true prefix of the leader’s current history. Every mutation after the replica’s offset must still be available at the leader. If safety cannot be proven, incremental resynchronization is forbidden. The decision procedure must not delay or block ongoing leader operations.current_replication_offsetNo additional state is consulted during the decision.Step 1 — Replica records progressDuring normal replication, the replica continuously updates and persists:The replication ID is as follows;The last offset it has applied.This persistent state represents the replica’s claim about its position in the leader’s history.Step 2 — Replica reconnects and declares its claimUpon reconnection, the replica sends:it’s stored replication ID;It’s stored replication offset.“I believe I am synchronized through offset X of history Y.”Step 3 — Leader validates history identityThe leader compares the replica’s replication ID with its own.If they differ, the replica’s state was produced under a different history.Offsets across different histories have no meaning.If identities differ, incremental resynchronisation is impossible.Step 4 — Leader validates offset plausibilityIf identities match, the leader checks:The replica offset must not exceed the leader’s current offset.A replica claiming to be ahead is invalid and unsafe to resume.Step 5 — Leader validates backlog coverageIf the offset is plausible, the leader checks whether it still retains the replica’s missing history.replica_offset >= backlog_start_offsetIf required mutations have been evicted, the history suffix is incomplete and resumption is unsafe.Step 6 — Deterministic outcomeIf all checks pass → resume from Otherwise → perform full resynchronisationThere is no intermediate or probabilistic outcome.function psync_decide( replica_repl_id, leader_repl_id, backlog_start_offset // Check history identity if replica_repl_id != leader_repl_id: return FULL_RESYNC(“history_mismatch”) if replica_repl_offset > leader_current_offset: return FULL_RESYNC(“invalid_future_offset”) // Check history availability if replica_repl_offset L.backlog_end: RETURN TRUEFUNCTION partial_resync_response(L, offset): stream = backlog_slice_from(L.backlog_data, offset) RETURN (“PSYNC_CONTINUE”, L.replid, stream)FUNCTION full_resync_response(L): snapshot = generate_snapshot(L) RETURN (“PSYNC_FULL”, L.replid, L.repl_offset, snapshot)FUNCTION replica_handle_psync_response(R, response): IF response.type == “PSYNC_CONTINUE”: apply_stream(R, response.stream) R.known_replid = response.replid R.known_offset = last_offset_applied(R) IF response.type == “PSYNC_FULL”: R.state = “loading” R.dataset = load_snapshot(response.snapshot) R.known_replid = response.replid R.known_offset = response.offset R.state = “online“FUNCTION leader_append_write(L, serialized_cmd): L.repl_offset = L.repl_offset + length(serialized_cmd) append_to_backlog(L, serialized_cmd) RETURN LFUNCTION append_to_backlog(L, data): append_bytes(L.backlog_data, data) L.backlog_end = L.repl_offset IF backlog_size_exceeded(L): RETURN LFUNCTION trim_backlog(L): new_start = compute_new_backlog_start(L) discard_bytes_before(L.backlog_data, new_start) L.backlog_start = new_startExecution flow (operational sequence)Replica reconnects and sends a PSYNC request containing its last known replication ID and offset.Leader compares the requested replication ID against its current and previous replication IDs.If the replication ID matches and the requested offset is still present in the backlog, the leader replies with a partial resynchronisation response and streams only the missing bytes.If the replication ID does not match or the offset is no longer in the backlog, the leader replies with a full resynchronisation response, including a snapshot and the current replication offset.Replica either applies the incremental stream or loads the snapshot, updates its replication metadata, and transitions to the online state.Normal command streaming resumes from that point forward.System-level rationale and trade-offsPSYNC minimises unnecessary full dataset transfers by reusing the replication backlog whenever possible.Replication IDs prevent unsafe resumption across divergent histories (for example, after failover or leader restart).Keeping a secondary replication ID allows safe resynchronisation across controlled leadership changes without forcing full resyncs.The backlog size defines a time window for recovery; larger backlogs increase memory usage but reduce the probability of full resynchronisation.The algorithm is strictly conservative: when in doubt, it forces a full resync to preserve correctness.PSYNC operates entirely on metadata and offsets before data transfer, making the decision fast and non-blocking.Partial resynchronisation algorithmIn a leader–replica replication system, replicas frequently encounter short-lived disruptions such as brief network partitions, rapid process restarts, or transient load stalls. These disruptions interrupt the replication stream but do not automatically invalidate the replica’s existing state.Restarting replication with a full dataset transfer after every such interruption is prohibitively expensive. It triggers large bandwidth spikes, CPU contention, memory churn, and latency amplification, destabilizing the system and affecting unrelated workloads.The system must therefore support :resume replication from a previously known-good position when it is safe;transfer only the mutations that the replica missed;guarantee that no mutation is skipped, duplicated, or reordered;detect situations where safe resumption cannot be proven and fall back to a full resynchronisation;make all decisions using minimal state and bounded memory;perform recovery without interfering with the leader’s ability to process writes.Partial resynchronisation is the mechanism that provides this conditional recovery by replaying a suffix of the replication stream from a bounded backlog under strict identity and offset constraints.Given a reconnecting replica and a leader that retains only a finite window of recent replication history, determine whether replication can safely resume from a known offset and, if so, replay the missing portion of the stream exactly once; otherwise, enforce a full resynchronisation. A replica may replay missing mutations only if its state was produced by the same mutation history as the leader’s current state. The replica’s state must correspond to a complete prefix of the leader’s history, not an approximation. Every mutation following the replica’s offset must still exist at the leader. Replay must begin at a precise offset boundary, not an inferred or approximate position. The backlog holding recent history must be strictly bounded and enforce eviction. Recovery logic must not stall, delay, or block the leader’s main execution path.current_replication_offseta bounded backlog of replication entriesNo dataset inspection, key comparison, or checksum computation participates in the decision itself.Step-by-step solution with pseudocodeStep 1 — Replica tracks and persists progressThe replica must record its exact position after applying each mutation so that it can later prove where it stopped.function replica_apply_entry(entry): apply_to_local_state(entry) replica.replication_offset = entry.offset replica.replication_id = entry.replication_id persist(replica.replication_id, replica.replication_offset)This guarantees the replica can always re-declare an exact point in history after restart.Step 2 — Replica reconnects and declares stateUpon reconnection, the replica loads its persisted state and reports it verbatim.function replica_on_reconnect(): load(replica.replication_id, replica.replication_offset) send_to_leader(replica.replication_id, replica.replication_offset)This is a claim, not proof. The leader must validate it.Step 3 — Leader validates history identityThe leader first checks whether the replica refers to the same mutation lineage.function check_history_identity(replica_id, leader_id): return replica_id == leader_idIf this fails, offsets lose all meaning. No incremental recovery is possible.Step 4 — Leader validates offset plausibilityEven if history matches, the offset must describe a real point in that history.function check_offset_sanity(replica_offset, leader_current_offset): return replica_offset = backlog_start_offsetIf any required mutation was evicted, replay cannot be complete.Step 6 — Deterministic recovery decisionAll checks collapse into a single, deterministic outcome.function decide_partial_resync( replica_repl_id, leader_repl_id, backlog_start_offset if not check_history_identity(replica_repl_id, leader_repl_id): if not check_offset_sanity(replica_repl_offset, leader_current_offset): return FULL_RESYNC if not check_backlog_coverage(replica_repl_offset, backlog_start_offset): return FULL_RESYNC return PARTIAL_RESYNC(replica_repl_offset + 1)There is no negotiation or fallback path.Step 7 — Leader streams missing mutationsWhen partial resynchronisation is approved, the leader replays the missing suffix.function leader_stream_backlog(replica, resume_from): offset = resume_from while offset = L.backlog_start AND offset_req B.end_offset: RETURN TRUEFUNCTION backlog_append(B, data): i = 0 B.buffer[B.write_index] = data[i] B.write_index = (B.write_index + 1) % B.size B.end_offset = B.end_offset + 1 IF B.end_offset - B.start_offset > B.size: B.start_offset = B.start_offset + 1 RETURN BFUNCTION backlog_slice(B, from_offset): IF from_offset B.end_offset: B.start_offset = B.end_offset RETURN B B.start_offset = new_start_offset RETURN BExecution flow (operational sequence)Leader initialises the backlog buffer with a fixed maximum size and the current replication offset.Each write command executed by the leader is serialised and appended byte-by-byte to the backlog buffer.As new bytes are appended, the end_offset advances monotonically.When the backlog exceeds its configured size, the start_offset is advanced, evicting the oldest bytes.When a replica reconnects and requests partial resynchronisation from offset X, the leader checks whether [start_offset, end_offset].If valid, the leader extracts a contiguous slice from the backlog starting at X and streams it to the replica.If invalid, the leader rejects partial resynchronisation and triggers a full resynchronisation.System-level rationale and trade-offsThe backlog buffer converts replica recovery from an expensive full snapshot transfer into a cheap byte-stream replay when offsets are still covered.Fixed-size circular storage provides strict memory bounds regardless of write throughput or replica count.Offset-based indexing avoids per-replica state coupling and simplifies correctness reasoning.Eviction is purely time/throughput-based (how fast the leader writes), not replica-aware; slow replicas may fall out of the recovery window.Larger backlog sizes increase tolerance to replica lag but consume more memory; smaller sizes reduce memory usage but increase the likelihood of full resynchronisation.The algorithm is optimised for the leader’s fast path: append and trim are O(1) per byte and require no locking in a single-threaded model.Failover election algorithmRedis enforces a : at any moment, exactly one node is allowed to accept writes. This constraint is absolute. Violating it creates divergent histories that cannot be reconciled automatically.In a distributed environment, failure detection is inherently ambiguous. A leader may appear unreachable because it crashed, because the network is partitioned, or because the observer itself is impaired. No single observer can distinguish these cases reliably.The system must therefore decide when to revoke write authority from the current leader and when to grant it to a replica, under uncertainty, partial information, and independent failures.The failover election algorithm must solve all of the following simultaneously:detect leader unavailability without trusting any single observer;prevent two replicas from being promoted concurrently;tolerate Sentinel crashes and network partitions;allow recovery to make progress without global synchronization;avoid long-term blocking when coordinators fail;preserve the single-writer invariant at all times.Sentinel achieves this through quorum-based failure agreement and epoch-scoped coordination.Every failover decision balances two opposing risks:Sentinel deliberately prefers delay over corruption. This choice drives every design decision that follows.Fundamental constraints that shape the design Every Sentinel sees the system through a different network lens.Agreement must be emergent No Sentinel is privileged; decisions emerge from overlapping views.Authority must be temporary Any coordinator can fail; authority must expire automatically.Time must be logical, not physical Clocks are unreliable; ordering must be explicit.Recovery must retry, not wait Blocking on failed coordinators leads to deadlock.Step-by-step solution with reasoningStep 1 — Continuous leader observation (local suspicion)Each Sentinel independently probes the leader.function monitor_leader(): if leader_unreachable_for(down_after_ms): else: subjectively_down = false“From my position, is the leader responding?”Step 2 — Collective confirmation (shared suspicion)A Sentinel that marks the leader down queries others.function count_subjective_down_votes(): votes = 1 // self for each sentinel in peers: if sentinel.subjectively_down: return votesfunction evaluate_objective_down(): if count_subjective_down_votes() >= quorum:Only here does the system accept that the leader is genuinely unavailable.Step 3 — Failover attempt demarcationOnce objective down is reached, Sentinels move logical time forward.function begin_failover_attempt(): current_epoch += 1“Which recovery attempt supersedes all previous ones?”Epochs ensure stale actions lose authority automatically.Step 4 — Coordinator candidacyEach Sentinel may attempt to coordinate the failover for the current epoch.function request_failover_votes(epoch): votes = 1 // self for each sentinel in peers: if sentinel.grant_vote(epoch): return votesfunction try_become_coordinator(epoch): if request_failover_votes(epoch) >= quorum:Each Sentinel grants at most one vote per epoch, ensuring exclusivity.Step 5 — Handling election failureIf no Sentinel gathers quorum:function handle_failed_election(): wait(randomized_delay)This prevents livelock and ensures eventual progress.Step 6 — Replica evaluation and selectionThe coordinator must choose the replica that minimizes data loss.function select_replica(): candidates = replicas_sorted_by( replication_lag, priority return candidates[0]Selection affects safety margins, not correctness.Step 7 — Promotion with scoped authorityOnly the coordinator for the current epoch may promote.function promote_replica(replica): if coordinator and epoch_is_current(): send(replica, “SLAVEOF NO ONE”) wait_for_ack(replica)If the coordinator crashes, authority disappears with it.Step 8 — Reconfiguration of remaining replicasfunction reconfigure_replicas(new_leader): for each replica in replicas: if replica != new_leader: send(replica, “SLAVEOF”, new_leader)This restores the single-writer topology.Step 9 — Completion and return to monitoringfunction finalize_failover(): update_known_leader(new_leader)Monitoring resumes immediately.Extended logic and trade-offsQuorum delays action to preserve safetyThe system deliberately requires agreement from multiple independent Sentinels before taking disruptive action. This means that a single Sentinel, or even a small minority, cannot trigger a leader replacement on its own.The practical consequence is that recovery may take longer in some failure scenarios. However, this delay is intentional. Temporary unavailability is reversible; data corruption caused by two writers is not. By requiring quorum agreement, the system ensures that failover happens only when failure is visible from multiple, independent viewpoints, greatly reducing the risk of promoting a new leader while the old one is still active somewhere in the network.This choice makes availability slightly worse in ambiguous situations, but correctness dramatically stronger.Epochs remove the need for revocationIn distributed systems, explicitly revoking authority is difficult because the very failure that requires revocation may also prevent the revocation message from being delivered.Epochs avoid this problem entirely. Authority is not revoked; it simply expires. Each failover attempt is tied to a unique, increasing epoch number. When a new epoch begins, all actions from older epochs automatically lose validity.As a result, there is no need to track or cancel old coordinators. If a Sentinel was coordinating a failover and then stalls, crashes, or becomes isolated, the rest of the system moves forward. Any late or duplicated actions from the old coordinator are ignored because they belong to an obsolete epoch.This transforms revocation from an active operation into a passive property of time progression.Coordinator failure is expected, not exceptionalThe design assumes that any Sentinel, including the one coordinating a failover, can fail at any point. This is treated as a normal condition, not an error path.For this reason, retry behavior is central, not auxiliary. If a coordinator fails before completing promotion, other Sentinels do not wait indefinitely. They advance to a higher epoch and repeat the election process.This ensures that progress is eventually made as long as a quorum of Sentinels remains operational. The system never becomes permanently stuck waiting for a single coordinator to recover.Randomized delays prevent contention stormsWhen multiple Sentinels detect failure at roughly the same time, they may all attempt to coordinate failover concurrently. If all retries happened in lockstep, repeated elections could collide indefinitely.Randomized delays break this symmetry. By waiting for slightly different amounts of time before retrying, Sentinels reduce the chance of repeated contention. Over time, one Sentinel gains enough separation to collect quorum votes and proceed.This approach avoids complex coordination while still ensuring eventual convergence.Replica choice affects data loss, not correctnessOnce a coordinator is elected, promoting any single replica will preserve the single-writer rule. From a correctness standpoint, all promotions are equivalent: only one node becomes leader.However, replicas may differ in how much data they have received from the previous leader. Choosing a replica that is more up to date reduces data loss after failover.For this reason, replica selection is treated as an optimization problem layered on top of a safe coordination mechanism. A poor choice may lose more recent writes, but it cannot violate the core safety guarantees.When the algorithm operates correctly, the following properties hold consistently:at any moment, only one node is allowed to accept writes;leader replacement occurs automatically without manual intervention;failures of individual Sentinels do not prevent recovery as long as quorum is available;overlapping or conflicting promotions cannot occur;Behaviour during network partitions follows clear, predictable rules.These outcomes are the direct result of quorum agreement, epoch scoping, and exclusive coordination.Operators influence system behaviour through a small set of explicit parameters: Determines fault tolerance and the size of the observation pool. Controls how many Sentinels must agree before failover begins. Larger quorums increase safety; smaller quorums increase responsiveness.Down-detection thresholds Define how long a Sentinel must lose contact with the leader before suspecting failure. Short thresholds react faster but risk noise; longer thresholds favor stability. Influence which replica is preferred during promotion, affecting data loss characteristics. Shape how aggressively Sentinels reattempt coordination after failure or contention.Together, these controls define where the deployment sits on the availability–safety spectrum.Consider a deployment with five Sentinels and a quorum of three.If one Sentinel becomes isolated, only one observer reports failure. Quorum is not met, so no failover occurs.If two Sentinels become isolated, the remaining three still do not have enough independent agreement to declare failure.When three Sentinels independently conclude the leader is unavailable, quorum is satisfied and failover begins.If the elected coordinator crashes mid-process, the remaining Sentinels advance to a higher epoch and repeat the election.Once a coordinator successfully gathers quorum and promotes a replica, the system stabilizes.At no point can two leaders be promoted simultaneously, regardless of timing or failures.With the Sentinel failover election designed and operated in this manner:write authority is never duplicated;recovery continues despite Sentinel crashes or partial partitions;no single Sentinel can unilaterally trigger a failover;coordination remains bounded in time and complexity;system behavior under failure is consistent and explainable.In a distributed Redis deployment, a leader (primary) node may become unavailable due to crashes, network partitions, hardware failure, or overload. The system must restore write availability by promoting a replica to become the new leader. This must be done , , and , without split-brain or conflicting promotions.The core problem the Sentinel failover election algorithm solves is:detect that a leader is genuinely unavailable, not just temporarily unreachable from one observer;coordinate multiple independent Sentinel processes to reach agreement; Sentinel as the failover coordinator;prevent simultaneous or conflicting failovers;tolerate partial network partitions and Sentinel failures;complete recovery without requiring strong consensus protocols or heavy coordination.Sentinel achieves this using quorum-based failure detection and a time-bounded, epoch-based leader election among Sentinels.Core invariants and guaranteesMultiple Sentinel processes independently monitor the same leader and replicas.A leader is considered objectively down only when a quorum of Sentinels agrees.Failover coordination is delegated to exactly one Sentinel elected per epoch.Each failover attempt belongs to a unique, monotonically increasing epoch.At most one leader promotion can succeed per epoch.Replicas follow the Sentinel-selected leader and reconfigure accordingly.If the elected coordinator fails mid-process, a new election can occur in a higher epoch. are independent watchdog processes.Each Sentinel maintains local health judgments.Failover proceeds only after quorum agreement.Leadership is not fixed; it is elected per failover event.Elections are lightweight, time-scoped, and retryable.STRUCT Sentinel: id voted_epoch known_replicas subjective_downFUNCTION sentinel_ping_leader(S): reachable = ping_leader() S.known_leader_state.subjective_down = TRUE S.known_leader_state.subjective_down = FALSEFUNCTION exchange_down_state(S, peer): send(peer, S.known_leader_state.subjective_down) peer_state = receive(peer) RETURN peer_stateFUNCTION check_objective_down(S): down_votes = 0 FOR each peer IN S.peers: peer_state = exchange_down_state(S, peer) down_votes = down_votes + 1 IF down_votes + 1 >= quorum(): S.known_leader_state.objective_down = TRUEFUNCTION start_failover_if_needed(S): IF S.known_leader_state.objective_down == TRUE: RETURN request_failover_votes(S)FUNCTION request_failover_votes(S): votes = 1 FOR each peer IN S.peers: granted = peer_vote_request(peer, S.id, S.epoch) votes = votes + 1 RETURN become_failover_leader(S)FUNCTION peer_vote_request(peer, candidate_id, epoch): IF epoch > peer.voted_epoch: peer.vote = candidate_id RETURN FALSEFUNCTION become_failover_leader(S): candidate = select_best_replica(S.known_replicas) promote_replica(candidate) reconfigure_other_replicas(candidate) update_configuration(candidate) RETURN SFUNCTION select_best_replica(replicas): best = NULL IF r.is_online AND r.replication_offset_is_max(): RETURN bestFUNCTION promote_replica(replica): send(replica, “SLAVEOF NO ONE”) wait_until_role(replica, “leader”) RETURNFUNCTION reconfigure_other_replicas(new_leader): FOR each r IN all_replicas_except(new_leader): send(r, “SLAVEOF”, new_leader) RETURNExecution flow (operational sequence)Each Sentinel continuously pings the leader and tracks local reachability.If a Sentinel cannot reach the leader for a configured interval, it marks it as Sentinels exchange their subjective views.When a quorum of Sentinels agrees, the leader is marked Any Sentinel detecting an objective down increments its epoch and starts a failover election.Sentinels vote once per epoch for the first valid request they receive.The Sentinel that gathers quorum votes becomes the failover coordinator for that epoch.The coordinator selects the most up-to-date replica and promotes it to the leader.Remaining replicas are reconfigured to follow the new leader.The system stabilises; normal replication resumes.System-level rationale and trade-offsQuorum-based detection avoids false positives caused by asymmetric network partitions.Epoch-based voting ensures stale or duplicate elections cannot succeed.Single-winner election per epoch prevents split-brain promotions.No heavy consensus protocol (e.g., Paxos/Raft) is required; the algorithm favors availability and simplicity.Sentinel processes can fail independently without compromising correctness.Failover is eventually consistent: during the transition, writes may be temporarily unavailable.Safety is preferred over speed: when uncertainty exists, failover may be delayed rather than risking divergence.The leader–replica model, when paired with PSYNC, partial resynchronisation, the replica backlog buffer, and Sentinel-driven failover, demonstrates a pragmatic approach to distributed system reliability. Rather than assuming stable connections or rare failures, these algorithms explicitly optimise for short outages, network jitter, and rolling failures, keeping recovery fast and predictable.The key insight is that availability is preserved not by eliminating failure, but by containing its blast radius. Backlog buffers cap resynchronisation cost, partial resync avoids unnecessary data transfer, PSYNC enforces correctness at reconnect time, and Sentinel ensures leadership is restored through quorum-based consensus instead of ad-hoc decisions.As a unified system, these mechanisms ensure that replication remains continuous, recoverable, and operationally manageable — even under adverse conditions. They illustrate a broader systems principle: high availability emerges from coordination protocols and recovery paths, not from replication alone.

Similar Posts