Scaling Beyond Memory: How Materialize Uses Swap for Larger Workloads

At Materialize, we often ask ourselves which parts of our system we could fundamentally change to enable new workloads. How we manage memory for maintained SQL objects is one such area. In this post, I’ll explain our previous approach, what limited its scalability, and how our new approach—swap—increases flexibility and delivers more value to our customers.

Users value Materialize for its data freshness. Results are always up-to-date, and we precisely report how quickly we respond to upstream changes. Materialize transforms SQL into differential dataflow programs that incrementally maintain results. The update cost depends on both the rate of input changes and the total data volume. We prioritize freshness, and to this end we might use more memory than absolutely needed to amortize…

Inside Materialize we use special indexes called arrangements to maintain data efficiently. Arrangements store data and changes in memory, similarly to a log-structured merge tree. An arrangement stores more recent updates in smaller blocks, and older updates in larger ones. This enables both low-latency updates and efficient storage of large amounts of data.

Materialize runs on regular computers with a limited amount of memory (RAM) and disk, so we must use these resources efficiently. To achieve minimal update and query latency, we would like to store hot data in memory, and only move cold data to disk. While memory offers faster access times, it’s both limited and costly. When workloads exceed available memory, we aim for graceful performance degradation by offloading portions of cold data to disk.

Phase 1: manually manage data that can spill to disk

Our previous approach to supporting larger-than-memory workloads was a custom memory allocator backed by memory-mapped files. This gave Linux the option to move data to disk when needed. The approach served us well for about two years, allowing us to handle larger-than-memory workloads with only a moderate performance impact.

However, Linux has inherent behaviors that restrict how effectively we can use memory with this approach.

Linux aims to limit the amount of files with unsaved changes (dirty), pushing content to disk when it determines necessary. While this feature benefits applications requiring data durability, it’s unnecessary for Materialize, where all data is ephemeral (or persisted to blob storage.) As a result, we frequently write data to disk that will never be accessed again, wasting CPU time and I/O bandwidth.

Linux also reserves disk space for memory-mapped files, even when it’s not immediately needed. While this policy makes sense for most applications, it’s less optimal for Materialize’s ephemeral data. Ideally, we would only allocate disk space when under memory pressure. Instead, this approach causes unnecessary I/O operations as the file system reserves space that often goes unused.

Given these limitations, we needed to divide which data can spill to disk and which data cannot upfront. Allowing all data to spill comes with a negative performance impact, and spilling too little has diminishing returns as a workload can run out of non-spillable memory before it exhausts disk. Previously, Materialize could only spill about twice the amount of physical memory to disk.

Dividing data into spillable and non-spillable categories creates a fundamental constraint: we can only handle as much non-spillable data as we have available memory. Exceeding this limit triggers out-of-memory errors and leads to a poor user experience. This becomes particularly challenging during the hydration phase after starting a workload, when we typically load, process, and index large volumes of data. Without precise sequencing of these operations, memory consumption can quickly exceed available capacity. Unfortunately, implementing such precise sequencing is difficult in many scenarios.

We published the allocator as an open-source project: rust-lgalloc.

Recently, we received a request to support a much larger workload, which prompted us to explore alternative approaches. When testing our previous approach with workloads of several TiB, we quickly discovered it couldn’t scale reliably to the required size.

Phase 2: let the operating system page memory

A significant development occurred: Kubernetes introduced new APIs for more flexible memory management. Specifically, Kubernetes graduated Linux swap support to beta, with vendors slowly adding support. Linux swap allows the operating system to move infrequently accessed memory segments to disk when under memory pressure, freeing space for active workload components. This process operates transparently to the application.

Swap is not effective for all applications. Those not specifically designed to organize memory efficiently will experience significant slowdowns when hitting memory limits. Our Phase 1 approach already required us to pack memory allocations into consecutive regions of data, which makes swap highly effective. When related data sits close together in memory, prefetching mechanisms can load what the application needs next from disk, amortizing the cost of disk access.

Our testing of Materialize with swap proved straightforward—it required only adjusting the Kubernetes configuration, as Materialize’s design already supported this functionality. Hydration and steady-state performance is better than in Phase 1, and it allows us to increase the ratio of memory to disk to enable larger workloads.

Swap in production

We’ve successfully rolled out swap to all customers of our Cloud hosted product. This was a seamless transition and we’ve verified performance characteristics met the same requirements that lgalloc previously provided.

With swap, we observe improved hydration times, and better memory utilization. This is because we only spill to disk when needed, and swap is more efficient than memory-mapped files. In numbers, we’ve seen a 30% reduction in hydration time, and tests show that we can offer 3x more memory at negligible freshness costs for most workloads.

For Self-Managed deployments, the situation is slightly more complex. While Kubernetes offers an API that should function everywhere, new features still take time to implement, and might not work as expected at the beginning. At the moment, we support swap in Amazon’s EKS (both with Bottlerocket and Amazon Linux), and we’re planning on supporting in GCP and Azure later.

Phase 1: manually manage data that can spill to disk

Phase 2: let the operating system page memory

Swap in production

Similar Posts