Disks Lie: Building a WAL that actually survives (opens in new tab)  🌐Distributed systems

12 Dec 2025 — 6 min read

Disks Lie: Building a WAL that actually survives Treat your disk storage as potentially hostile

A write-ahead log (WAL) is one of those database concepts that sounds deceptively simple. You write a record to disk before applying it to your in-memory state. If you crash, you replay the log and recover. Done.

Except your disk is lying to you.

PostgreSQL, SQLite, RocksDB, Cassandra... every production system that claims to be durable relies on a WAL. It’s the fundamental contract: "Write here, and I promise your data survives." But making that promise actually stick requires understanding all the ways disk fail silently.

The Naive Approach vs Reality

Let’s say you implement a WAL like this:

write(fd, record, sizeof(record)); // Done, right... RIGHT?

In a test environment on your laptop, this works great. But when you handle millions of writes a day, those 1 in a million errors happen multiple times a day. Some of these systems will fail in ways your tests never catch:

  • The page cache problem: That write() just copied your data into the kernel’s buffer. It hasn’t touched the disk, yet. Crash now, and it’s gone.
  • The disk that lies about success: Your write() returns success. The kernel tells you it’s synced. The disk firmware tells you it’s on stable storage. Then a latent sector error silently corrupts it anyway.
  • The ordering chaos: Write operation A starts. Write operation B starts. B completes first. Your recovery code sees B without A and has no idea what happened.
  • The single point of failure: One bad sector on your only copy of the WAL, and you lose everything.

This is why people who’ve lost data in production are paranoid about durability. And rightfully so.

Building the Better Mousetrap

There are 5 layers of defense that we can use to build a better mouse trap. Think of them as increasingly specific answers to the question: "How can this fail?"

Layer 1: Checksums (CRC32C)

Every record includes a checksum of its contents. After writing, we verify the checksum hasn’t changed. Simple, right?

Record Header (20 bytes):
[magic_num: 4][sequnce_num: 8][checksum: 4]
[payload: variable]
[padding to 512 byte alignment]

Why this matters: Hardware bit flips happen. Disk firmware corrupts data. Memory busses misbehave. And here’s the kicker: None of these trigger an error flag. The I/O subsystem returns success. The data is just silently wrong. Without checksums, you discover this weeks later when you try and recover and find your log is garbage.

Layer 2: Dual WAL Files (LSE Protection)

Another solid strategy to help protect against a specific kind of failure: a latent sector error (LSE), is to keep two WAL files, ideally on different disks.

Loading more...

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help