Learn how to build a crash-safe Write-Ahead Log (WAL) in Go, and why CRC32 alone is not enough. We explore the durability layers UnisonDB uses to prevent corruption after crashes.
The Problem#
Every database promises durability. Write your data, get an acknowledgment, sleep well. But what happens between the write() syscall (or a memory-mapped store) and the moment electrons finally settle on persistent media?
There is a long, leaky pipeline β and every layer can betray you. A lot can go wrong:
- Power failure mid-write β The system crashes while writing, so only part of your data reaches disk.
- **Bit fliβ¦
Learn how to build a crash-safe Write-Ahead Log (WAL) in Go, and why CRC32 alone is not enough. We explore the durability layers UnisonDB uses to prevent corruption after crashes.
The Problem#
Every database promises durability. Write your data, get an acknowledgment, sleep well. But what happens between the write() syscall (or a memory-mapped store) and the moment electrons finally settle on persistent media?
There is a long, leaky pipeline β and every layer can betray you. A lot can go wrong:
- Power failure mid-write β The system crashes while writing, so only part of your data reaches disk.
- Bit flips β Hardware issues or random errors can silently change stored data.
- False success signals β The operating system may report success even though data is still in memory.
- Filesystem limits β Journaling keeps files intact, not your dataβs meaning. WAL-on-WAL is not correctness β itβs wishful thinking.
- Torn writes β A single 4KB write can span multiple sectors, and only some of them may commit.
The purpose of a Write-ahead logging is not just to record writes in order, but to make correctness provable. After a crash, the database treats the log as evidence, replaying only records that can be proven complete and correct.
The Stakes: Streaming Replication#
UnisonDB WAL isnβt just a recovery mechanismβitβs the primary source for replication. Followers continuously read from the leaderβs WAL segments, applying records as they arrive.
βββββββββββ WAL Segments ββββββββββββ
β Leader β βββββββββββββββββββΆ β Follower β
β β (streaming read) β β
βββββββββββ ββββββββββββ
This means:
- Corruption propagates - A bad record on the leader poisons followers
The WAL is our single source of truth. It had better be correct.
The Record Format#
Every record in our WAL has this structure:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WAL Record β
ββββββββββββ¬βββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββ€
β CRC32 β Length β Data β Trailer β
β 4 bytes β 4 bytes β N bytes β 8 bytes β
ββββββββββββ΄βββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ€
β Padded to 8-byte boundary β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Layer 1: CRC32 (Castagnoli)#
// CRC covers everything except itself
crc := crc32.Checksum(buf[4:], crc32.MakeTable(crc32.Castagnoli))
binary.LittleEndian.PutUint32(buf[0:4], crc)
CRC catches:
- Random bit flips
- Sector read errors
- Truncated data
CRC doesnβt catch:
- Incomplete writes - If we crash mid-write, the CRC might be valid for the partial data
Garbage data is worse than missing data. Missing data can be retried or ignored; garbage data looks valid and gets applied. Once that happens, corruption spreads.
Layer 2: The Trailer Canary#
This is where 0xDEADBEEFFEEDFACE enters the story.
const recordTrailer uint64 = 0xDEADBEEFFEEDFACE
Every record ends with this 8-byte magic value. During recovery, we verify it.
A trailer marker makes correctness provable by separating βwrittenβ from βfinished.β
func (seg *Segment) Read(offset int64) ([]byte, RecordPosition, error) {
...
// validating the trailer before reading data
// we are ensuring no oob access even if length is corrupted.
trailerOffset := offset + recordHeaderSize + dataSize
end := trailerOffset + recordTrailerMarkerSize
if end > seg.mmapSize {
return nil, NilRecordPosition, ErrIncompleteChunk
}
...
return data, next, nil
}
Why the Trailer?#
Consider this crash scenario:
Write sequence:
1. Write CRC (4 bytes) - persisted
2. Write Length (4 bytes) - persisted
3. Write Data (N bytes) - CRASH - only 50% written
4. Write Trailer - never reached
Without the trailer, recovery sees:
- Valid CRC (for partial data? unlikely but possible)
- Valid length field
- Garbage data
With the trailer, recovery sees:
- Trailer missing or wrong β incomplete write, ignore this record
This pattern was inspired by a real etcd bug (#6191 ) where torn writes corrupted the WAL. The trailer acts as a βcommit markerβ for each record.
Why This Specific Value?#
- Unlikely to appear in real data (itβs a known debug pattern)
- Easy to spot in hex dumps
Layer 3: WAL Alignment: What It Guarantees (and What It Doesnβt)#
UnisonDB aligns every WAL record to an 8-byte boundary. The goal isnβt to βmake disk writes atomicβ. The goal is to make WAL parsing and recovery safe.
The dangerous failure mode isnβt a torn payload (CRC can catch that). Itβs a torn or corrupted header:
[ CRC32 (4 bytes) | Length (4 bytes) ]
This 8-byte header controls how the rest of the log is parsed:
- how many bytes to read
- where the next record begins
- when recovery should stop
If this header is partially written or misinterpreted, recovery can:
- read past the end of the file
- allocate unbounded memory
- crash before corruption is detected
That is catastrophic failure. So the core question is:
How do we ensure WAL control metadata is either read correctly or rejectedβnever misinterpreted?
The Invariant Enforced.
Every WAL record starts at an offset divisible by 8.
This is implemented directly in the write path:
func alignUp(n int64) int64 {
return (n + 7) & ^7
}
Every recordβs total size (header + payload + trailer) is rounded up using alignUp, and the write offset is advanced by that aligned size. As a result:
writeOffset % 8 == 0
What 8-Byte Alignment Guarantees
Headers Never Straddle Physical Boundaries Since 512 and 4096 are multiples of 8, an 8-byte header starting at an 8-byte offset cannot cross a sector or page boundary. It effectively lives inside a single atomic write unit. This makes βtorn headersβ mathematically impossible. 1.
Corruption is Detectable, Not Fatal Without alignment, a torn header introduces βphantomβ valid recordsβrandom bytes interpreted as a massive length field, causing OOMs or wild seeks. With alignment, a header is either correct or obviously invalid (failed CRC). We never interpret garbage metadata as valid control instructions. 1.
Safe Termination Recovery becomes a simple state machine: read until the first error, then stop. Because the structure of the log remains intact, the reader never crashes while trying to determine where the log ends. It makes recovery deterministic.
What Alignment Does Not Guarantee#
8-byte alignment does not:
- guarantee atomic disk writes
- make payload writes safe
- replace fsync / msync
- eliminate the need for trailers or CRCs
Alignment protects interpretation, not persistence. Without alignment, a corrupted length field can crash recovery before corruption is detected. With alignment, corruption is contained:
- recovery either proceeds correctly
- or stops safely
This does not guarantee atomicityβbut it dramatically reduces risk.
Each layer protects a different invariant#
| Layer | Protects Against |
|---|---|
| Alignment | Partially written headers and invalid length fields |
| CRC32 | Data corruption from bit flips or torn payload reads |
| Trailer | Records that were not fully written before a crash |
Together, they ensure that:
- We never read beyond intended bounds
- We can detect corruption
- We can safely stop at the first incomplete record
Layer 4: Directory Sync#
Why Directory Sync?#
On Linux, calling fsync() on a file only guarantees that the fileβs data and metadata are durable. It does not guarantee that the filesystem has persisted the directory entry that makes the file visible.
As the Linux manual explicitly warns:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that, an explicit fsync() on a file descriptor for the directory is also needed.
If the system crashes at the wrong moment:
- The WAL segment was written
- The file itself was fsyncβd
- But the directory update was never persisted
Consider this crash scenario without directory sync:
1. Create new segment file - file exists in memory
2. Write segment header - data in page cache
3. fsync segment file - data on disk
4. [CRASH]
5. Recovery: "where's segment 3?" - directory entry never persisted!
After recovery, the file simply does not exist.
This is especially dangerous for systems with automatic WAL segment rotation, where new segment files are created continuously as the log grows. Losing a segment file during rotation means losing part of the log β even though all writes were βsuccessfulβ
With directory sync:
1. Create new segment file - file exists in memory
2. Write segment header - data in page cache
3. fsync segment file - data on disk
4. fsync directory - directory entry on disk
5. [CRASH]
6. Recovery: segment 3 exists and is valid
type DirectorySyncer struct {
dirFd *os.File
}
func (d *DirectorySyncer) Sync() error {
return d.dirFd.Sync() // fsync on directory fd
}
By calling fsync() on the directory:
- The filesystem is forced to persist directory entries
- Newly created segment files are guaranteed to exist after a crash
- Segment rotation becomes crash-safe
When we use directory sync
We only need this at structural boundaries, such as:
- Creating a new WAL segment
- Finalizing segment rotation
- Renaming or deleting segment files
This keeps the hot write path fast while ensuring the WAL layout itself is durable.
Layer 5: Conservative Recovery#
Our recovery philosophy: when in doubt, stop.
func (w *WALog) recoverSegments() error {
segments := listSegmentFiles(w.dir)
sort.Sort(segments) // By segment ID
for _, seg := range segments {
if err := seg.recover(); err != nil {
// Don't try to be clever - stop at first corruption
w.logger.Error("segment corrupt, truncating",
"segment", seg.ID,
"error", err)
return seg.truncateAtLastGoodRecord()
}
}
return nil
}
We donβt try to skip corrupted records and continue. Why?
- Ordering matters - A gap in the log might hide a critical transaction
- Corruption often spreads - If one record is bad, neighbors might be too
- Raft requires contiguity - Log indices must be sequential
Write#
func (seg *Segment) Write(data []byte, logIndex uint64) (RecordPosition, error) {
if seg.closed.Load() || seg.state.Load() != StateOpen {
return NilRecordPosition, ErrClosed
}
seg.writeMu.Lock()
defer seg.writeMu.Unlock()
flags := binary.LittleEndian.Uint32(seg.mmapData[40:44])
if IsSealed(flags) {
return NilRecordPosition, ErrSegmentSealed
}
offset := seg.writeOffset.Load()
seg.writeFirstIndexEntry(logIndex)
headerSize := int64(recordHeaderSize)
dataSize := int64(len(data))
trailerSize := int64(recordTrailerMarkerSize)
rawSize := headerSize + dataSize + trailerSize
entrySize := alignUp(rawSize)
if offset+entrySize > seg.mmapSize {
return NilRecordPosition, errors.New("write exceeds Segment size")
}
binary.LittleEndian.PutUint32(seg.header[4:8], uint32(len(data)))
sum := crc32Checksum(seg.header[4:], data)
binary.LittleEndian.PutUint32(seg.header[:4], sum)
copy(seg.mmapData[offset:], seg.header[:])
copy(seg.mmapData[offset+recordHeaderSize:], data)
canaryOffset := offset + headerSize + dataSize
copy(seg.mmapData[canaryOffset:], trailerMarker)
paddingStart := offset + rawSize
paddingEnd := offset + entrySize
// ensuring alignment to 8 bytes
for i := paddingStart; i < paddingEnd; i++ {
seg.mmapData[i] = 0
}
newOffset := offset + entrySize
seg.writeOffset.Store(newOffset)
binary.LittleEndian.PutUint32(seg.mmapData[24:32], uint32(newOffset))
prevCount := binary.LittleEndian.Uint64(seg.mmapData[32:40])
binary.LittleEndian.PutUint64(seg.mmapData[32:40], prevCount+1)
binary.LittleEndian.PutUint64(seg.mmapData[16:24], uint64(time.Now().UnixNano()))
crc := crc32.Checksum(seg.mmapData[0:56], crcTable)
binary.LittleEndian.PutUint32(seg.mmapData[56:60], crc)
seg.appendIndexEntry(offset, uint32(len(data)))
// MSync if option is set
if seg.syncOption == MsyncOnWrite {
if err := seg.mmapData.Flush(); err != nil {
return NilRecordPosition, fmt.Errorf("mmap flush error after write: %w", err)
}
}
return RecordPosition{
SegmentID: seg.id,
Offset: offset,
}, nil
}
Read#
func (seg *Segment) Read(offset int64) ([]byte, RecordPosition, error) {
if seg.closed.Load() {
return nil, NilRecordPosition, ErrClosed
}
if offset+recordHeaderSize > seg.mmapSize {
return nil, NilRecordPosition, io.EOF
}
header := seg.mmapData[offset : offset+recordHeaderSize]
length := binary.LittleEndian.Uint32(header[4:8])
dataSize := int64(length)
rawSize := int64(recordHeaderSize) + dataSize + recordTrailerMarkerSize
entrySize := alignUp(rawSize)
if length > uint32(seg.WriteOffset()-offset-recordHeaderSize) {
return nil, NilRecordPosition, ErrCorruptHeader
}
if offset+entrySize > seg.WriteOffset() {
return nil, NilRecordPosition, io.EOF
}
// validating the trailer before reading data
// we are ensuring no oob access even if length is corrupted.
trailerOffset := offset + recordHeaderSize + dataSize
end := trailerOffset + recordTrailerMarkerSize
if end > seg.mmapSize {
return nil, NilRecordPosition, ErrIncompleteChunk
}
// previously we were doing byte which did show in pprof as runtime.memequal
// switching to uint64 comparison removed it altogether.
word := binary.LittleEndian.Uint64(seg.mmapData[trailerOffset:end])
// validating trailer marker to detect torn/incomplete writes.
if word != trailerWord {
return nil, NilRecordPosition, ErrIncompleteChunk
}
data := seg.mmapData[offset+recordHeaderSize : offset+recordHeaderSize+dataSize]
// sealed segments are immutable and may have been recovered
// from disk after a crash or shutdown. CRC validation ensures that data
// persisted to disk is still intact and wasn't partially written or corrupted.
// for active segment, we do one validation at start if not sealed, else it's in the
// same process memory, so having corruption of the same byte is very unlikely, until
// done from some external forces.
// doing this in the hot-path is CPU intensive and most of the read are towards the tail.
if seg.isSealed.Load() && !seg.inMemorySealed.Load() {
savedSum := binary.LittleEndian.Uint32(header[:4])
computedSum := crc32Checksum(header[4:], data)
if savedSum != computedSum {
return nil, NilRecordPosition, ErrInvalidCRC
}
}
next := RecordPosition{
SegmentID: seg.id,
Offset: offset + entrySize,
}
return data, next, nil
}
Lessons Learned#
- CRC alone isnβt enough - You need to detect incomplete writes too
- fsync isnβt enough - You need directory sync for metadata
- mmap is tricky - msync semantics vary by OS; always fsync the fd
- Alignment matters - 8-byte alignment reduces torn write risk
- Be conservative in recovery - Stop at first corruption, donβt guess
- Test failure modes - If you havenβt tested it, it doesnβt work
Conclusion#
A WAL is deceptively simple: append bytes, sync, done. The complexity hides in the failure modes:
- What if the write is torn?
- What if the sync lies?
- What if a bit flips?
- What if the directory entry isnβt persisted?
Each layer of our designβCRC, trailer, alignment, header checksum, directory sync, conservative recoveryβaddresses a specific failure weβve either experienced or studied in othersβ post-mortems.
The 0xDEADBEEFFEEDFACE trailer is a perfect metaphor: it looks like a joke, but itβs deadly serious. In distributed systems, the boundary between working and broken is measured in these small details.
Build your WAL like you donβt trust anythingβbecause you shouldnβt.
Github: https://github.com/ankur-anand/unisondb
Appendix: Some Discussions#
- https://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic
- https://en.wikipedia.org/wiki/Solid-state_drive
- HN Discussion: Every HDD since the 1980s has guaranteed atomic sector writes - LMDB Author has posted some interesting insights on this.
The UnisonDB WAL is open source. Star us on GitHub if this saved you from learning these lessons the hard way.