metapool
** lightweight, cache-friendly pool allocator with compile-time configurable layout **
🕳️ Packed memory grid inside L1/L2 cache for simulation workloads
🪦 Header-only - no external dependencies; include mtp_memory.hpp to start
🧬 std::allocator adapter for direct use with standard templates
🌀 Up to ~1300x faster than malloc, ~3.5x faster than heap-free PMR pool
🧿 Allocation trace tools to log and visualize memory usage
🔳 introduction
metapool is a lightweight, high-performance memory allocator with compile-time layout configuration and preallocated thread-local arenas, written in C++23 for a game engine.
Unlike general-purpose allocators, it uses a pool-style layout tailored to expected allocation patterns. This repository includes native containers …
metapool
** lightweight, cache-friendly pool allocator with compile-time configurable layout **
🕳️ Packed memory grid inside L1/L2 cache for simulation workloads
🪦 Header-only - no external dependencies; include mtp_memory.hpp to start
🧬 std::allocator adapter for direct use with standard templates
🌀 Up to ~1300x faster than malloc, ~3.5x faster than heap-free PMR pool
🧿 Allocation trace tools to log and visualize memory usage
🔳 introduction
metapool is a lightweight, high-performance memory allocator with compile-time layout configuration and preallocated thread-local arenas, written in C++23 for a game engine.
Unlike general-purpose allocators, it uses a pool-style layout tailored to expected allocation patterns. This repository includes native containers (WIP) and benchmarks against malloc, std::allocator (ptmalloc), and std::pmr::unsynchronized_pool_resource backed by std::pmr::monotonic_buffer_resource with a thread-local upstream buffer. In these tests, metapool’s dynamic array mtp::vault reaches up to 1300x faster creation and reserve() than std::vector, and up to 3.5x faster than std::pmr::vector.
metapool implements std::allocator and std::pmr::memory_resource adapters, making it usable as a backend for standard containers and smart pointers.
🔳 benchmark
Test system:
- Arch Linux laptop (kernel 6.15.8)
- glibc 2.42 (ptmalloc)
- clang 20.1.8
- AMD Ryzen 7 PRO 8840U
- L1d: 256 KB, L1i: 256 KB, L2: 8 MB, L3: 16 MB
- 64 GB DDR5
Test cases:
- std:
std::allocator+std::vector(heap) - pmr:
pmr::unsynchronized_pool_resource+std::pmr::monotonic_buffer_resource+pmr::vector(no heap, throwing upstream) - mtp:
metapool+mtp::vault(no heap) - malloc: raw
ptmallocallocation
Optimization flags:
-O3 -march=native -fstrict-aliasing -flto
Build & run:
./build.sh clean
./build.sh run micro
./build.sh run selective
Results:
🔳 quickstart
Standard templates:
#include "mtp_memory.hpp"
auto vec = mtp::make_vector<int, mtp::default_set>();
auto ptr = mtp::make_unique<int, mtp::default_set>();
auto str = mtp::make_string<mtp::default_set>("hello");
auto map = mtp::make_unordered_map<int, float, mtp::default_set>();
Core allocator API:
#include "mtp_memory.hpp"
// raw memory allocation
auto metapool = mtp::get_allocator<mtp::default_set>();
auto* block = metapool.alloc(size, alignment);
// metapool-native construction path (no container, efficient inlining)
auto* obj = metapool.construct<YourType>(42);
metapool.destruct(obj);
// reset all freelists (objects are lost)
metapool.reset();
Metaset and native containers (WIP):
#include "mtp_memory.hpp"
// custom metaset
using custom_set = mtp::metaset <
mtp::def<mtp::capf::mul2, 64, 8, 16, 64, 128>
>;
// dynamic array mtp::vault<T>
auto v1 = mtp::make_vault<int, custom_set>(); // no allocation
auto v2 = mtp::make_vault<int, custom_set>(10); // reserve space
auto v3 = mtp::make_vault<YourType, custom_set>(10, 42); // construct 10 objects
v3.emplace_back(YourType{}); // grow and emplace 11th
mtp::vault<int, custom_set> v4;
v4.reserve(10);
Container set selection (optional - pass via compiler flags or define before including mtp_memory.hpp):
#define MTP_CONTAINERS_MTP // enable mtp containers (experimental)
#define MTP_CONTAINERS_STD // enable std containers (factory helpers)
#define MTP_CONTAINERS_BOTH // enable both mtp and std containers
#define MTP_CONTAINERS_NONE // disable all container headers (default)
The std::allocator and std::pmr::memory_resource adapters are compiled either way, so you can still use metapool with standard containers manually.
metapool is initialized lazily. To force thread-local initialization:
mtp::init<mtp::default_set>();
🔳 architectural overview
Each allocator instance holds a preconfigured metapool set - metaset. Each metapool manages a range of stride sizes with fast lookup through a proxy array.
A metapool contains multiple pools, each managing blocks of a specific stride, and a fast intrusive freelist backs each pool. A stride is a size class - the block size (in bytes), always a multiple of the configured stride step. Each allocation consists of:
- 2 bytes of metadata (header), stored just before the block’s data
- the object’s memory
- optional padding for alignment
Allocated objects are aligned to at least the default alignment quantum (8 bytes). If stricter alignment is needed, the stride is increased to fit it. Since stride steps are multiples of the alignment quantum, alignment is always resolved during stride selection. There’s no need for per-block alignment logic. Maximum supported alignment is 4096 bytes. metapool is SIMD-compatible.
Each allocator uses a freelist proxy array, with one entry per stride. When allocating, the stride index is computed from the size and alignment, and used to access the corresponding proxy. The same index is stored in the 2-byte header for fast deallocation.
metapool has no global fallback - arena size and freelist block counts are defined in the metaset at compile time.
If a freelist has no free blocks, allocation steps through the next larger stride until one succeeds. Since proxies are sorted by stride, this fallback is a fast linear scan. If all eligible freelists are exhausted, the allocator fails explicitly.
🔳 defining metaset
Each pivot is a stride.
Each stride is a multiple of the stride step and is equal to the size of the block.
Each stride step is a power of two, with a minimum of 8 bytes and a maximum of 8 MB.
Metapool entry in a metaset:
def <
capacity_function,
base_block_count,
stride_step,
stride_min (pivot 0),
strides... (pivots 1...N-1),
stride_max (pivot N)
>
- capacity_function – controls how block count grows between strides
- base_block_count – blocks allocated at the first stride
- stride_step – byte spacing between consecutive strides
- stride_min – starting stride (pivot 0)
- pivot 1...N-1 – intermediate strides where block count changes
- stride_max – final supported stride (pivot N)
Each stride pivot divides a metapool’s stride range into segments. The index of a pivot defines both the segment index and the exponent used in the capacity function.
[ stride_min (pivot_0), pivot_1, pivot_2, pivot_3, ..., stride_max (pivot_N)]
If the capacity function is mul4, the base block count is 256, and the pivot index is used as the exponent, then block counts are computed as:
block_count = base_block_count × (4^pivot_index)
This produces a sequence like:
256 * 4^0, 256 * 4^1, 256 * 4^2, 256 * 4^3, ..., 256 * 4^N
Capacity functions allow you to scale block counts across a stride range without needing a separate metapool for every change. Fewer metapools means faster allocation lookup.
Stride ranges across metapools must not overlap, but gaps are allowed. This helps optimize for sparse allocation patterns. Allocation sizes are rounded up to the nearest supported stride. For example, if the smallest stride is 1024 bytes and you allocate 2 bytes, the allocator will use 2 bytes for the header and waste 1020 bytes per block.
🔳 example metaset
metaset <
def<capf::mul2, 128, 8, 16, 64, 128, 192>, // metapool 1
def<capf::flat, 256, 16, 256, 512>, // metapool 2
def<capf::flat, 64, 8, 576, 576> // metapool 3
>;
Metapool 1:
Range: 16 to 192 bytes, step 8
Base block count: 128
Capacity grows at 64 and 128 using mul2
Total strides: (192 − 16) / 8 = 22
Metapool 2:
Range: 256 to 512 bytes, step 16
Constant block count: 256 (flat)
Total strides: (512 − 256) / 16 = 16
Metapool 3:
Single stride: 576 bytes
Block count: 64
Step is irrelevant since there's only one stride
🔳 memory trace
To enable trace instrumentation, define MTP_ENABLE_TRACE or pass it to the compiler.
To export trace data:
#define MTP_ENABLE_TRACE
/*
...
traced allocations
...
*/
mtp::export_trace("trace/your_traced_system.csv");
Then run the script (requires python + matplotlib):
python plot_trace.py trace/your_traced_system.csv
This prints trace data to the terminal and generates a picture:
raw_size | proxy | count | fallbacks | raw_total | stride_total | peak
------------------------------------------------------------------------------
8 | 0 | 1904764 | 0 | 15238112 | 30476224 | 14476224
16 | 0 | 1500000 | 0 | 24000000 | 36000000 | 12000000
32 | 1 | 3404870 | 0 | 108955840 | 136194800 | 56194800
64 | 2 | 1500002 | 0 | 96000128 | 108000144 | 36000072
128 | 3 | 3405190 | 0 | 435864320 | 463105840 | 191105704
16777216 | 11 | 6 | 0 | 100663296 | 100663344 | 50331672
33554432 | 11 | 6 | 0 | 201326592 | 201326640 | 100663320
67108864 | 11 | 4 | 0 | 268435456 | 268435488 | 134217744
134217728 | 12 | 4 | 0 | 536870912 | 536870944 | 268435472
268435456 | 13 | 2 | 0 | 536870912 | 536870928 | 268435464