Column Storage for the AI Era

In the past few years, we’ve seen a cambrian explosion of new columnar formats, challenging the hegemony of Parquet: Lance, Fastlanes, Nimble, Vortex, AnyBlox, F3 (File Format for the Future). The thinking is that the context has changed so much that the design of yore (the previous decade) is not going to cut it moving forward. This seemed a bit intriguing to me, especially since the main contribution of Parquet has been to provide a standard for columnar storage. Parquet is not simply a file format. As an open source project hosted by the ASF, it acts as a consensus building machine for the industry. Creating six new formats is not going to help with interroperability. I spent some time to understand a bit better how things actually changed and how Parquet needs to adapt to meet the demands of this new era. In this post I’ll discuss my findings.

Whether we like it or not, we’re living in an AI-dominated era. Some argue that the main consumer of data will become AI and not humans, yet much of our data infrastructure was designed for a very different time.

I’m the chair for the Parquet PMC at the ASF. As any participant in this ecosystem, my perspective here is biased. However that gives me a useful vantage point to talk about both what we got right and where the format can be improved.

How We Got Here

Started back in 2012, Parquet became the de facto standard for columnar storage. Even though I put a lot of effort into building a community and making sure it would happen, I’m still amazed by the impact of what started as a side project.

At the time, the main argument that convinced us to invest more in the project was simple: the very first prototype, that wasn’t even called Parquet yet, saved 30% on storage right off the bat with very little optimization. That’s just one benefit of columnar representation.

We were coming out of the Hadoop era, and the format was inspired by Google’s Dremel paper. At the time, if you were implementing a Google paper, you were the real deal.

Parquet enabled interoperability across this new generation of vectorized query engines that were starting to emerge based on database research (C-Store, MonetDB/X100).

The Original Trade-off Triangle

The fundamental trade-off was three-way: cost of storage versus time to decode spent by the CPU versus time to transfer on the wire.

If you compress more then it costs less to store and takes less time to transfer on the wire. However, it may take more CPU time to decompress. The question becomes: given the characteristics of CPUs at the time, what encodings and compression schemes meet the sweet spot?

The cheapest data to transfer is still the data you skip entirely. This is where the columnar layout shines. Thanks to columns, you can push down projections: only access the columns you actually need. Thanks to statistics embedded at the row group and page level, you can also prune partitions and row groups: only access the rows you need. This allows you to download a lot less data. That’s how we make OLAP work, picking sort orders and partitioning schemes so that we don’t have to scan all the data.

The Layout

The Parquet file structure has footer metadata at the bottom that points to row groups, which contain column chunks, which are split into pages. Pages are the unit of decoding and decompression. Back in the Hadoop days, we sized row groups based on the HDFS block size. Nowadays, in the object storage world, it is often common to have one big row group per file.

image source

The page structure has values that get encoded (for example with a dictionary) and then optionally compressed with something like ZStandard. The compression is optional because sometimes the encoding is good enough. If it compresses well, it’s faster to just decode than running through a general-purpose compression algorithm. Specialized lightweight encodings beat generic compression for speed, when you can get away with it. In the past as networks were relatively slower than CPUs, it made sense to trade off more compression for more CPU cost but the ratio has changed as networks have gotten faster.

Praise: What People Like

Columnar representation compresses a lot better, not mentioning that binary formats are more efficient than text based formats like CSV and JSON.
Parquet is self-contained and self-describing, embedding its schema. When exchanging data, there’s no room for interpretation or mistakes in type inference.
Multi-level statistics allow you to skip data. Statistics at the row group level let you prune entire row groups. Statistics in each page let you implement predicate push downs and skip decoding.
And perhaps most importantly: wide adoption. A de facto standard with an engaged community of implementers. If you come to the Parquet sync we have every two weeks, you’ll find folks from most of the major database vendors and open source projects. A thriving community we can rely on to evolve the standard.

Gripes: What People Don’t Like

Not Parallelizable enough: The encodings don’t take full advantage of SIMD or GPUs. I was once talking with someone implementing the decoding of Parquet on GPUs and they asked me why we designed a specific encoding in a way that was making it really hard to parallelize. Well, in 2012, GPUs were not as mainstream as today and that was just not an optimization criteria.
The metadata gets large on datasets with many columns. Nowadays we have wider schemas more often. For example, people build feature stores and end up with a lot of columns. We just didn’t optimize for millions of columns.
The format isn’t optimized for random access. When you compress pages, the minimum amount of data you need to decode and decompress is the entire page. If you have random access patterns, there’s a minimum time cost to get to a single value inside the page as you need to decode the surrounding data.
We’re too reliant on generic compression. Zstandard is great, but we can do much better with type-specific lightweight encodings.

What Changed?

Hardware

Looking at data from 2010 to today, focusing on high-end servers, we now have 16 times more cores on average and SIMD became four times as wide. That’s a lot more available parallel processing.

Similarly, GPUs today have about 40 times more “threads” than they had 15 years ago. Keeping in mind that GPUs work more like SIMD than traditional threads: they execute the same instruction many times in parallel at the same time.

That’s the first shift: We have way more parallelism available in processing.

AI Access Patterns

The second shift is about how we’re accessing data.

When querying embeddings in a vector stores, we need to quickly retrieve documents. For example, implementing RAG for a chatbot, querying the vector store is only one of the steps in a potentially long pipeline before we get to generating an answer. In this context every millisecond adds up and you need to reduce the latency as much as possible.

Those vector stores are using columnar storage, but they also need very fast random access. Not only the sequential scanning we originally optimized for. Additionally to process data “at the speed of AI”, we also need to have more throughput when scanning the data, possibly even processing it where the AI lives: in the GPU.

There’s a few things we can improve in Parquet to adjust for those needs.

The Healthy Pressure of Newcomers

These shifts have motivated people to create new approaches to encoding data. New research papers have been published, exploring lightweight encodings (specialized encodings for a certain data types that compress well but are also extremely fast) that take much better advantage of the parallelism of the modern hardware: BtrBlocks, ALP (floating points), FastLanes (integers), FSST (strings).

A flurry of new file formats came out recently that challenged the status quo:

Lance and Vortex, created by startups building vector and multi-modal stores. They are optimizing for AI use cases. (Vortex was donated to the LFAI&Data where it is incubating)
Nimble, created by Facebook, is considering the API to be the contract rather than the file format.
Fastlanes, AnyBlox and F3 (literally “File Format for the Future”) are research-driven formats.

Now we are faced with the conundrum of figuring out which ones will manage to reach broad adoption in the ecosystem. From where I’m standing it seems easier to contribute the missing bits to Parquet and build those systems on top. These formats prove that these techniques work. We can get much better performance by applying new approaches. As a community, we should take a hint and evolve what we have.

Has the design fundamentally changed?

If you look at the design of these formats, they’re fundamentally using the same columnar layout. Some things vary, they pay more attention to how they lay out metadata, and they pick encodings better suited to their use cases but the overall encapsulation format isn’t that different. They still follow PAX, have multi-level structures like row groups and pages (which can be called by a different name) and metadata to skip at different levels in the same way.

Some of the features provided, like table mutations and versioning, (Lance is also a table format) are influencing Iceberg instead of Parquet.

Parquet as a Consensus Building Machine

So what is Parquet?

You might think Parquet is a file format. But Parquet is, first and foremost, an open-source project. It’s a consensus building machine. It pulls a huge community of open source projects and vendors in its wake.

There’s an often quoted proverb: “If you want to go fast, go alone. If you want to go far, go together.”

Parquet moves slower because it’s pulling the community behind it. That’s the hard part. Building the community and the consensus is what takes time, but the benefit is broad adoption in the ecosystem.

I find it a bit meta when research papers use some particular implementation of the Parquet format as a baseline to evaluate their results. Parquet’s encodings are based on previous research papers. Parquet is a moving target. It’s what we’ve integrated from research so far. Eventually, if your approach is much better, it will find its way in the project and will become the new baseline.

We stand on the shoulders of giants: those researchers building encodings are doing the hard work of showing the way we should follow next. We make sure we get consensus in the industry to adopt their work. The trade off is that we can not adopt everything and we need to balance things out to minimize complexity.

Case Study: The Variant Type

To give you an example of how bigger changes make their way into Parquet, about a year ago, engineers made an initial proposal to find a neutral home for the variant type that was at the time in Spark. Variant is akin to a binary representation of JSON. It separates the field names in one column and the values in another. You can selectively shred a subset of the fields into their own column. It is useful when you have unknown field cardinality or too many sparse fields in your data. The big question was whether this new type should be defined in Spark, Arrow, Iceberg or Parquet. What made the most sense, knowing that all of those projects (and more) would end up using it?

We agreed to put it in Parquet. Then we worked as a community to finalize a consensus on the spec. We needed to make sure everybody was on the same page. We changed a few things, made sure we all agreed, and then implemented it accross the ecosystem. (Thanks to Gang, Aihua, Gene, Micah, Andrew, Ryan B, Ryan J, Yufei, Jiaying, Martin, Aditya, Matt, Antoine, Daniel, Russell and many others)

The community produced multiple implementations in multiple systems, open source or not and collaborated on cross-compatibility tests to make sure we were building compatible systems. This included individuals from Databricks, Snowflake, Google, Tabular, Datadog, CMU, InfluxData, Dremio, Columnar and more (I’m sorry, if I forgot you, please reach out and I’ll add you here).

Now we know that when a Variant is written in one system, it’s going to be read correctly in another. From Databricks to Snowflake and BigQuery and from DataFusion to DuckDB and Spark, No surprises. (And Dremio, and InfluxDB, etc)

That’s how the consensus building machine works.

The Technical Path Forward

You might think that optimizing for large data scans for OLAP and optimizing for random access are at odds. But here is the good news, the characteristics of an encoding that lets you parallelize more and decode more data faster are also characteristics that enable faster random access. The trick is removing all data dependencies.

You need to ensure that decoding the next value doesn’t depend on decoding the previous value. Otherwise you can not decode them in parallel and you also cannot decode the nth value without decoding all the previous ones.

Some of Parquet’s encodings are very data-dependent. A common encoding technique for a measure that varies by small amounts from one value to the next is delta-encoding. You compute the difference between each subsequent values. These are small integers that can be bit-packed in a much smaller size. However to decode all the values you need to sum them in order. This is not really parallelizable and reading any value requires decoding all the values before it.

In addition, pages are, by default, compressed using a general purpose compression algorithm (like zstd) that will also need to decompress all the data before the data you need to read in the page.

This limits the throughput of decoding by limiting parallelism and increases latency of random reads by increasing the amount of data that needs decoding before we can read the data we are interested in. It also doesn’t allow evaluating expressions directly on the encoded form.

The metadata in the footer is in Thrift. Like Protobuf, it requires sifting through the whole datastructure to decode a subset, reducing the ability to skip to the metadata of only the columns you need when schemas are very wide. Flatbuffer did not exist at the time but there’s a proposal in progress (thanks to Alkis) that creates a new Parquet footer using Flatbuffer instead of Thrift. This backward-compatible addition, allows much easier selective decoding of metadata which is especialy useful for wide schemas.

The New Encodings

The new encodings being proposed all focus on removing data dependencies and taking advantage of SIMD.

ALP (Adaptive Lossless Floating Point) turns floating-point values into integers efficiently. A lot of the floating-point values we’re storing are actually decimals and there are good ways to encode decimals as integers. The challenge is that a decimal floating-point number doesn’t have an exact representation in binary. The paper solves how to do this really fast, taking advantage of SIMD, without losing precision.

proposal for ALP in progress by Prateek.

FastLanes handles integer encoding by optimizing for SIMD (single instruction, multiple data) and branch prediction. Modern processors start executing the next instruction before the previous one finishes. Every time they guess wrong on a branch, you lose cycles. The key technique is making sure there’s no branching anywhere in the decoding logic.

Initial discussion of PFOR encodings by Prateek

FSST is a dictionary encoding for subsets of strings. You compress by building an efficient dictionary of common substrings. Two advantages: you can evaluate equality on encoded data without decoding, and you have random access, none of the values depend on the previous one.

proposal for FSST in progress by Arnav.

BtrBlocks talks about layering lightweight encodings (very fast encodings specialized for specific types as opposed to general-purpose compression like Zstandard).

Initial discussion on nesting encodings by Arnav.

Practical Improvements

Tuning existing Parquet settings

First, there are practical tweaks that you can use today that will benefit those new use cases:

More row groups cause more requests to object storage: You can have just one row group per file if you want.

You have to decompress too many values to access one: You can make smaller pages for finer-grained access. Pages don’t need to be aligned nor be the same size either within column chunks nor within row groups.

Block compression requires decompressing the entire page: It is optional. Often dictionary encoding for example compresses well enough that you don’t need additional compression.

Some encoding don’t allow random access: when writing the file, pick encodings that will allow random access (dictionary) and avoid the ones that don’t (delta_string).

Better use of Metadata

Sure metadata gets big when you have a million columns, but do we really need to fit a million columns is a single file? If all the columns are dense (mostly all defined, few nulls), it’s hard to store a very wide schema in a columnar layout no matter how you do it. Given a partition of a certain size, you end up splitting it in many small columns and lose efficiency. More likely, only a subset of the columns in this wide schema are dense. In that case, you should probably be using the variant type to combine the sparse columns together. It gives you the flexibility of storing all sparse data that doesn’t really work in a columnar layout into one column and have a hybrid approach that avoids the degraded performance caused by many small columns. The metadata also doesn’t have to be read from the file. If you’re building a query engine on top of Parquet files, there are lots of ways to store it in a way that will be faster to query in your planning phase.

The Pragmatic Path Forward

My take is the only blocking issue is integrating those newer encodings in Parquet. Naturally, we need to agree on the right way to encode the data. Everything else: indexes, table abstractions, indexed access to metadata, can be built on top. Some other improvements will go in Iceberg and Arrow, for example regarding storing certain columns in different files or allowing late materialization. There is no binary choice of having to either contribute everything to existing projects versus creating an entire new stack. We can build composable systems that are modular. This is the whole point of collaborating on standard formats that enable interoperability.

As we make progress on that, we will keep closing the gap between the OLAP world and the needs of vector databases. Along the way, breaking down the silos that otherwise lose compatibility with the rest of the ecosystem.

We’re already seeing the path forward: new encodings that serve both OLAP and AI workloads, metadata improvements, pragmatic tweaks, and a community that’s engaged and ready to evolve.

Parquet will keep adopting newer advances from research. There are already ongoing effort to integrate ALP and FSST as better parallel encodings that allow random access and migrating the footer from Thrift to Flatbuffer to reduce metadata overhead and facilitate random access to metadata. These efforts will keep making progress as we built a shared agreement on how this should happen in the community.

If you think that contributing to Parquet looks hard, I would encourage you to try. The worse that can happen is you’ll be pleasantly surprised. Achieving broad adoption takes many years of community building. Parquet, Arrow and Iceberg have a thriving community. Consensus building takes time, but is worth it, creating a lot of momentum in the ecosystem.

That’s the power of an open-source project that’s fundamentally about building consensus.

And that’s where we are: adapting column storage for the AI era, one encoding at a time.

If you’re interested in contributing to Parquet or want to discuss any of these proposals, join us on the mailing list and at the bi-weekly Parquet syncs. The community is what makes this work.