Visualizing K-Way Merge: An Interactive Guide to Database Sorting (opens in new tab)

aerial view of three empty tennis courts

Photo by Aleksandr Galichkin on Unsplash

In this post, inspired by a video by Tigerbeetle engineers Tobias Ziegler (@toziegler) and Alex Kladov (@matklad), I’ll walk through four algorithms for merging multiple sorted runs of data; a problem that shows up in modern database internals. You will be able to use an interactive web application to explore and understand each one.

The title image isn’t just a pretty picture of tennis courts; it’s the inspiration for the most efficient external sorting algorithm: The Tournament Tree.

External sort and k-way merge

Whilst most engineers are familiar with a number of sorting algorithms that work with data sets that fit entirely in memory, there are use cases, both historical and current, that do not. In these cases you must pull in small chunks of memory and sort them before outputting to a larger sorted output.

As an extreme example take the 1960 USA Census for example. The Census Bureau’s UNIVAC 1105 was a supercomputer of the day, with up to 18 attached tape drives. Even so, it had a limited memory (less than 300kb) and so there is no way to sort the data in memory.

Instead the process used was called Balanced k-way merge. Input data was put onto one set of tapes, and another set of tapes are used as output. The computer read as many records as it could fit in memory, sorted them and wrote them to tape. The tape was then filled with these sorted runs.

The k-way merge part comes from k (the number of runs being handled at once, in this case the number of tapes, let’s assume 4). At the first step the computer reads the first records from each tape and outputs the smallest one to the output tape. Next, it gets the next record from the tape where the smallest record came from. This process continues until the records on all 4 tapes have been exported to a single sorted run on the output tape.

External sort in modern databases

Modern databases use essentially the same logic. An LSM tree (see next section) flushes "SSTables" (sorted runs) to disk (which are modern "tapes") and then performs a background k-way merge (compaction) to combine them. The physics changed (seek times vs. linear scan), but the math of the k-way merge remains exactly the same.

External sort is used for various purposes:

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help