Optimizing RocksDB in TiKV (Part 1) – The Battle Against the DB Mutex

4 min readJust now

–

When we started building TiKV nine years ago, we chose RocksDB as our storage engine. And for good reasons — it’s a highly efficient, battle-tested key-value store with a flexible LSM-tree design that’s proven at scale. It gave us a great foundation to build a distributed transactional key-value database on.

But as our users’ data kept growing — terabytes upon terabytes, millions of SST files — we began to encounter the darker side of RocksDB. Performance bottlenecks surfaced, especially under high concurrency and large-scale deployments. Over time, we developed numerous optimizations for RocksDB to make it production-ready for TiKV’s scale.

Now, after nearly a decade, it feels like the right moment to look back and document some of these efforts — not o…

4 min readJust now

–

Now, after nearly a decade, it feels like the right moment to look back and document some of these efforts — not only to share what we learned, but also to celebrate the engineering journey. Some of these optimizations have since made their way upstream to the official RocksDB repository. Others remain TiKV-specific, still powering massive production clusters every day.

This post is the first in a series where I’ll highlight a few of these optimizations.

The One I Hate the Most: The Global DB Mutex

If there’s one thing that haunted us for years, it’s the global DB mutex in RocksDB.

So many performance issues can be traced back to it.

Let’s take one real case from our production experience: Optimize async snapshot duration tail latency (#18743).

At one customer site, the workload was quite typical — continuous bulk data ingestion. Over time, the dataset in RocksDB grew to around 4 TB with 290 k+ SST files. The P99 latency reached 100+ ms, which was simply unacceptable.

We ran a local simulation and discovered that the major bottleneck was contention on the global mutex — particularly during the LogAndApply phase of RocksDB’s version management.

Why Does RocksDB Need a Global Lock?

RocksDB uses a global lock (DB mutex) to ensure thread-safety and consistency across many internal operations. It protects:

Version management — creating and switching between database versions
MemTable operations — switching or flushing
Column families — creation, modification, and deletion
Compaction scheduling — choosing which files to compact
Metadata updates — keeping file and level information consistent
Snapshot list maintenance — to prevent concurrent modifications
Write coordination — to synchronize write batches and WAL operations

All of these happen inside one single mutex — which means any operation that needs it can block others.

Understanding LogAndApply

The mutex contention mainly comes from RocksDB’s LogAndApply operation, which is executed on every flush, compaction, ingestion, and similar events. Its cost grows proportionally with the number of SST files — exactly why it becomes a major performance bottleneck at large scale.

Press enter or click to view image in full size

There are four parts including three major lock-related parts in the LogAndApply process:

PrepareApply — executed outside the mutex, so we can safely ignore it.

SaveTo (Part 1) — merges the version edits with the base LSM-tree file set to generate a new LSM-tree file set.

CheckConsistency is skipped in release builds.
SaveSSTFilesTo performs a merge sort to generate file locations based on the sorted results. Since SST files are immutable after creation, this operation doesn’t need to be protected by a mutex.

Unref ~Version (Part 2) — deletes the current version, including destructing the arena of the LSM-tree file set.

~Version needs to unlink the version from the version list, which is protected by the mutex.
However, the subfield VersionStorageInfo of Version can be freed in a background thread, so not all parts of this process require the mutex.

AppendVersion (Part 3) — appends the new version and unreferences the current version.

This step is hard to move out of the mutex, as it directly manipulates the shared version list and ensures version consistency.

Our Optimization: Move Work Out of the Mutex

To solve this, we made VersionStorageInfo heap-allocated instead of a stack object, and moved the heavy part of SaveTo outside the mutex.

Specifically:

Converted VersionStorageInfo from a member object to a pointer, allowing asynchronous background deletion
Moved SaveTo (merge logic) outside of the mutex to reduce contention
Added a background cleanup of VersionStorageInfo via RocksDB’s environment scheduler

This change decouples most of the CPU-heavy work from the global mutex, dramatically improving concurrency. You can see all the changes here: Optimize LogAndApply in-mutex duration

The Result: Tail Latency Drops 100×

After applying this optimization, the p999 latency for local read async snapshots in TiKV dropped from ~100 ms to under 1 ms.

(Left: baseline, Right: after optimization)

Press enter or click to view image in full size

Looking Back

This was one of RocksDB optimizations we made for TiKV.

It taught us an important lesson: sometimes the bottleneck isn’t your code, it’s the lock you didn’t question.

In the next parts of this series, I’ll share other stories: separating write mutex from DB mutex, write-amplification-based rate limiting, performance instrumentation with PerfFlag, etc — each with its own war story and engineering trade-offs. Stay tuned.

What’s Next

In our next-generation TiDB X, we’ve gone one step further — we rewrote the entire storage engine, moving beyond RocksDB.

It’s now built directly on object storage, designed from the ground up for the cloud era — more scalable, cost-efficient, and elastic than anything RocksDB could offer.

I’ll share more about TiDB X and how it reimagines cloud-native storage architecture in future posts. If you’re curious to see what TiDB can do today, try it yourself in minutes with a free TiDB Cloud cluster.