Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database.
Week 0: Introduction
Week 1: In-Memory Store
Week 2: LSM Tree Foundations
Week 3: Durability with Write-Ahead Logging
Week 4: Deletes, Tombstones, and Compaction
Week 5: Leveling and Key-Range Partitioning
**[W…
Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database.
Week 0: Introduction
Week 1: In-Memory Store
Week 2: LSM Tree Foundations
Week 3: Durability with Write-Ahead Logging
Week 4: Deletes, Tombstones, and Compaction
Week 5: Leveling and Key-Range Partitioning
Week 6: Block-Based SSTables and Indexing
In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine:
Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file.
Worst case: You read the whole file and parse everything, then search for the key.
This week, you will switch to block-based SSTables. Data will be chunked into fixed-size blocks designed to fit within a single disk page. The main benefits:
Efficient I/O: Each lookup can fetch a complete block with a single page read.
Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency.
Smaller on disk: Binary encoding typically compresses better than JSON.
Integrity: Per-block checksums detect corruption without requiring a re-read of the file.
Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead.
Alongside the data blocks, you will maintain a small index that stores the first key of each block and its corresponding offset, allowing lookups to jump directly to the relevant block without scanning all of them.
💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server (#kv-store-engine channel):
Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about.
Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier.
A block-based SSTable will be composed of:
One index block (first 4 KB page)
Multiple data blocks (each 4 KB)
Each block has a fixed size of 4 KB. Aligning blocks to 4 KB means a disk read can fetch a block in one page. If blocks are not aligned, a read may span two pages.
Here’s the file layout at a glance:
Offset 0 4096 8192 ...
┌─────────────────────┬────────────────────┬────────────────────┬──────
│ Index block (4 KB) │ Data block 0 (4 KB)│ Data block 1 (4 KB)│ ...
└─────────────────────┴────────────────────┴────────────────────┴──────
The layout of an index block (4 KB):
blockCount: The number of data blocks in the SSTable.
A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup.
To make the index fit into a single 4 KB page, it must contain at most 63 entries.
Here’s the layout (note this is a binary layout; newlines are used only for the representation):
<blockCount><pad>
|----1B----||63B|
# First index entry
<firstKey> # Right-padded with 0x00
|--64B--|
# Second index entry
<firstKey> # Right-padded with 0x00
|--64B--|
# Third index entry, etc.
# Padding with 0x00 to reach exactly 4096 bytes
NOTE: If you’re not familiar with the concept of padding: it’s filling unused bytes (here with 0x00) so fields and blocks have fixed sizes.
blockCount has a value between 0 and 63. If you encoded 63 as text, you would need two bytes (‘6’ = 0x36 and ‘3‘ = 0x33). Instead, you can store it as a binary integer so it fits in one byte: 0x3f.
Same layout, with explicit offsets:
Offset Size Field
0 1 blockCount
1 63 padding (0x00) to 64-byte alignment
64 64 Entry 0: firstKey of data block 0 (right-padded to 64B)
128 64 Entry 1: firstKey of data block 1 (right-padded to 64B)
192 64 Entry 2: firstKey of data block 2 (right-padded to 64B)
... 64 ...
64+64*n 64 Entry n: firstKey of data block n (right-padded to 64B)
# If fewer than 63 entries, fill remaining bytes with 0x00 up to 4096B.
An example of an SSTable with three data blocks, hence three entries. Remember: this is binary; newlines are for readability only:
3<63 0x00> # 1B key + 63B pad = 64B slot
aaa<61 0x00> # 3B key + 61B pad = 64B slot
everyLittleIslandSeemsEmpty<37 0x00> # 27B key + 37B pad = 64B slot
hello<59 0x00> # 5B key + 59B pad = 64B slot
<3840 0x00> # Tail pad: 4096 - (4*64)
This index block indicates:
Block 0 starts with the key aaa.
Block 1 starts with the key everyLittleIslandSeemsEmpty.
Block 2 starts with the key hello.
You don’t need to store per-block offsets. Because the index is stored on a 4 KB page and every data block is exactly 4 KB and written contiguously, offsets can be calculated this way (blockId starts at 0):
offset(blockId) = (blockId + 1) × 4096
Therefore:
Block 0 starts at offset 4096.
Block 1 starts at offset 8192.
Block 2 starts at offset 12288.
Now, let’s focus on data blocks.
In addition to the key-value entries, reserve 8 bytes in the block at the start to store a CRC computed over entryCount + all entries; this lets you verify data integrity on read.
The layout of a data block (4 KB per block):
Header (128 B):
CRC-64 (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182).
entryCount (1 B): the number of entries in this block (0..31).
Padding (119 B).
Entries area (31 x 128 B = 3968 B), each entry is:
key (64 B, right-padded).
value (64 B, right-padded).
The last data block may contain fewer than 31 entries (entryCount < 31), but always pad with zeros to reach exactly 4 KB. This guarantees one-page reads and prevents errors across read modes (e.g., SIGBUS with mmap).
The layout of a data block (again, newlines are used only for the representation):
<CRC64><entryCount><pad>
|--8B--|----1B----|119B|
<key><value> # Keys and values are right-padded with 0x00
|-64B|-64B|
<key><value> # Keys and values are right-padded with 0x00
|-64B|-64B|
<key><value> # Keys and values are right-padded with 0x00
|-64B|-64B|
# ...
# Zero-fill unused 128B entry slots so block size = 4096B
Same layout, with explicit offsets:
Offset Size Field
0 8 CRC64: CRC over [8..4096)
8 1 entryCount: 0..31 valid entries
9 119 padding
# Entries area (31 slots × 128B = 3968B)
128 64 Entry 0: key (right-padded to 64B)
192 64 Entry 0: value (right-padded to 64B)
256 64 Entry 1: key (right-padded to 64B)
320 64 Entry 1: value (right-padded to 64B)
384 64 Entry 2: key (right-padded to 64B)
448 64 Entry 2: value (right-padded to 64B)
... ... ...
# If entryCount < 31, zero-fill remaining 128B slots so block size = 4096B.
An example of a block composed of three key-value pairs:
0x42F0E1EBA9EA3693<entryCount=3><119 0x00>
alastor<57 0x00>foo<61 0x00>
aristide<56 0x00>bar<61 0x00>
armelle<57 0x00>z<63 0x00>
<3584 0x00> # (31 - 3) × 128B = 3584B
Note that because the index block holds at most 63 key entries, an SSTable can have at most 63 data blocks. With 31 entries per block, that caps an SSTable at 63 × 31 = 1,953 entries.
A tombstone is represented by a value of 64 bytes all set to 0x00. Due to this sentinel, the all-zero value is reserved and cannot be used as an application value from this week onward.
Searching for a value doesn’t change (memtable → L0 → L1, etc.). What changes is how you read one SSTable (remember: from L1, you only need to read one SSTable per level because of non-overlapping key ranges).
The process to read from an SSTable:
Read blockCount.
1.
Binary search the index in [0, blockCount-1] to find the largest firstKey ≤ key and get blockId.
If not found (e.g., first index key is bac and your key is aaa), return a miss for this SSTable.
1.
Compute the block offset:
blockOffset(blockId) = (blockId + 1) × 4096.
1.
Fetch the corresponding 4 KB block. 1.
Verify CRC before using the block:
Compute CRC64 over bytes [8..4096).
Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. 1.
Read entryCount.
1.
Binary search the entries in [0, entryCount-1] for the key.
1.
Return the corresponding value or a miss.
Last week, you split at 2,000 entries during the compaction process. This week, because a single SSTable is limited to 1,953 entries, change the split threshold to 1,953.
There are no changes to the client. Run it against the same file (put-delete.txt) to validate that your changes are correct.
Drop the 64-byte constraint: store a length-prefixed key and value per entry (short header with key length and value length).
Keep entries sorted and include the lengths in your checksum.
Tombstones are currently represented by a sentinel value (a 64-byte all-zero value), which prevents storing an actual empty value.
Instead, avoid reserving any value for deletes: add an explicit entry type per record (value or tombstone).
Now that the format is binary, compression becomes more effective and saves more space.
As an optional task, compress each data block independently so lookups still touch only one block:
Record each block’s offset and compressed size in the index.
Read just those bytes, decompress, and search.
This packs more logical blocks into each cached page, raising cache hit rates, reducing pages touched during scans, and smoothing read latency.
That’s it for this week! You implemented block-based SSTables and indexing, gaining benefits like more efficient I/O and reduced write amplification.
In two weeks, you will focus on improving read performance by adding a layer that can tell whether an SSTable is worth parsing, and say goodbye to your hashtable-based memtable, replacing it with a more efficient data structure.
For a production-grade implementation of block-based SSTables, see RocksDB’s block-based SSTable format. It details block layout, per-block compression, and how the index stores offsets and sizes.
You can also check out ScyllaDB’s SSTables v3 docs. ScyllaDB maintains a small in-memory summary of sampled keys to narrow the search, then uses the on-disk index to locate the exact block. This provides a nice contrast to our single-page index and illustrates how to scale when SSTables grow large.
For a deeper look at how things work in practice in terms of directory structure, you can explore the ScyllaDB SSTables directory structure, which shows how metadata and data are organized on disk.
Regarding CRC read failures, we mentioned that a checksum mismatch should simply cause the read to fail for that SSTable. In real systems, databases rely on replication to handle corruption. When multiple replicas exist, a system can recover by using data from an intact replica if one becomes corrupted or unavailable. Upon detecting a checksum mismatch, the system discards the corrupt replica and rebuilds it from a healthy one. This approach only works as long as a valid replica exists, which is why frequent checksum verification is critical: it ensures corruption is caught and repaired as early as possible, before it propagates.
❤️ If you enjoyed this post, please hit the like button.