From time to time I get to investigate issues with some sort of a batch process. It’s getting more and more common that such processes use very high memory limits (maintenance_work_mem and work_mem). I suppose some DBAs follow the logic that “more is better”, not realizing it can hurt the performance quite a bit.
Let me demonstrate this using an example I ran across while testing a fix for parallel builds of GIN indexes. The bug is not particularly interesting or complex, but it required a fairly high value for maintenance_work_mem (the initial report used 20GB).
To validate a fix, I ran the CREATE INDEX for a range of maintenance_work_mem values, and with different numbers of parallel workers. The goal was to check if some of those still fail, but I also measured t…
From time to time I get to investigate issues with some sort of a batch process. It’s getting more and more common that such processes use very high memory limits (maintenance_work_mem and work_mem). I suppose some DBAs follow the logic that “more is better”, not realizing it can hurt the performance quite a bit.
Let me demonstrate this using an example I ran across while testing a fix for parallel builds of GIN indexes. The bug is not particularly interesting or complex, but it required a fairly high value for maintenance_work_mem (the initial report used 20GB).
To validate a fix, I ran the CREATE INDEX for a range of maintenance_work_mem values, and with different numbers of parallel workers. The goal was to check if some of those still fail, but I also measured timings. And I got this chart:
This is from an D96v4 instance on Azure, with Xeon Platinum 8573C, 384GB RAM and RAID0 on 6x NVMe. That makes it fully cached and CPU-bound.
The parallelism helps quite a bit, it’s ~1.8x faster with 2 workers (the leader is counted as a worker). That is almost perfect speedup, as the final phase of the build remains serial. The speedup decreases with more workers (with 8 workers it’s only ~4.5x), but that’s expected.
But the chart also shows that increasing maintenance_work_mem makes it slower! Going from 64MB to 16GB increases the duration by 30%, no matter how many workers are used.
Why is this happening?
There are likely multiple factors contributing to this behavior. Let me explain two I believe to be the most significant.
First, there’s the L3 cache size.
All RAM in a system is not equally fast, there are several tiers with very different performance. A tiny fraction of RAM is integrated into the CPU, and so has very low latency. Most CPUs have only ~32-128MB of this fast on-CPU RAM, but it’s about an order of magnitude faster than main memory.
When building indexes, we usually accumulate data into a buffer, process it when the buffer gets “full,” and then combine it into a final index. For GIN indexes we add entries into a hash table, which implies random memory access. When the hash table overflows the L3, it has to access the main memory more often. And that is much more expensive than accessing L3, the difference is about 20 vs. 200 cycles.
It’s often better to process data in smaller chunks, sized so that they fit into the L3 cache. Sure, we’ll need more chunks, but it may still be a win.
Note: There’s a lot more interesting info about memory in a paper What Every Programmer Should Know About Memory, by Ulrich Drepper. It’s from 2007, but the fundamentals did not change.
There are other reasons why it may be better to process data in smaller chunks. Larger chunks may increase pressure on other parts of the system, and even cause stalls.
For example, when the hash table exceeds the memory limit (as set by maintenance_work_mem), the data gets written out to disk. These are temporary files, without durability requirements, so it’s fine to write the data only to page cache.
However, the kernel also tries to keep the amount of dirty data under control, using two thresholds. After hitting vm.dirty_background_ratio it starts writing out dirty data in the background, while user writes still go to page cache. If the amount of dirty data keeps growing and hits vm.dirty_ratio, all writes are forced to be synchronous.
Ideally, the background writes are sufficient to control the amount of dirty data, and the vm.dirty_ratio threshold is never reached. How well this proactive flushing works depends on how much time the kernel has to respond. Writing data more often and in smaller chunks gives the kernel more time to flush out some of the data.
Say you it takes a minute to accumulate 8GB of data in the hash table. You may either wait for the 8GB and then write it out all at once. There will be no writes most of the time, and then there’ll be a short period with 8GB of writes. Or you may accumulate and write out 64MB chunks. In this case the writes will be much more spread over time, giving the kernel time to react.
Of course, in some cases it may be better to accumulate more data. It may improve compression ratio, for example. Hard to say.
Summary
All of this applies to work_mem too, of course. The only difference is that work_mem is used for regular queries, not for maintenance operations. But the reasoning is exactly the same, though. A hash join builds a hash table, limited by work_mem. If the hash table exceeds L3, it’ll get slower. Same for a hash aggregate, sorting, and various other operations.
I’m not sure what’s the “best” setting for maintenance_work_mem and work_mem. But that wasn’t what this post was about. My point is that blindly setting a very high value may be quite harmful.
My recommendation would be to stick to modest values (e.g. 64MB), and only adjust the value if you can demonstrate the benefits.
Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.