Bf-Tree A Modern Concurrent Larger-Than-Memory Range Index

The term "column-store" can refer to a couple different models. Older OLAP systems do use the "index-less" DSM [1] style where every column is effectively its own index. The downside is poor write performance; slow ingestion limits practical data model sizes. (Also, some data types are not sortable, which creates other issues.) I wasn’t thinking of these models.

Most modern OLAP-ish systems use a "columnar-within-a-page" (CWAP) models like PAX [2]. These have write performance close to row-stores (NSM). While PAX has been replaced with more SIMD-friendly vector formats, they are operationally similar for these purposes. CWAP pages can be externally organized however you wish. They have no natural global organization for query purposes, so you have to provide logic…

External sharding at scale is where "mini-page" models can help. More and smaller shards improve query performance. Writes may be arbitrarily distributed across CWAP pages; mini-pages allow you to fit many times more write buffers in the same memory, greatly reducing page faults and I/O amplification. There is also often a correlation between data being written and queries of that data, so it helps with that too.

OLAP engines commonly attach min/max values to pages, shards, etc. There is no consistent terminology for this AFAICT (e.g. range filter, zone map, interval filter, constraint map, et al). These can provide extremely good performance for some columns at low cost, which is why just about every OLAP system uses them. This is largely unrelated to the use case for mini-pages.

There is no use case for space-filling curves (Morton/Z or Hilbert) unless you are storing your data on tape or some other medium that only supports linear search. A quad-tree is literally equivalent to a 2-dimensional space-filling curve and far more convenient. Spatial indexing is useful for OLAP but it has the same issue of writes being scattered across more pages than you have buffers.

The workloads that originally motivated sub-page architectures were large-scale write-intensive analytics. My first use case for a database that worked this way was streaming ingest of 100TB-1PB per day of telemetry from mobile carrier backbones while being able to analyze that data (ad hoc SQL queries!) in real-time. This workload was bottlenecked by the rate of online updates to the indexing, which is where "mini-page" concepts really helped. That was 15 years ago!

Most of the technical complexity is making this play nicely with the physical storage layer and designing around the fact that the implementation must be memory efficient since it solves a problem largely created by not having enough memory.

[1] https://dl.acm.org/doi/epdf/10.1145/971699.318923

[2] https://www.vldb.org/conf/2001/P169.pdf

Similar Posts