1T row challenge in 76s using 10k CPUs
docs.burla.dev·4d·
Discuss: Hacker News
📊Column Stores
Preview
Report Post

In this example we:

Generate 1,000 billion-row Parquet files (2.4TB) and store them in Google Cloud Storage.

Run a DuckDB query on each file in parallel using a cluster with 10,000 CPUs.

Combine resulting data locally.

Demo Video:

What is the Trillion row Challenge?

An extension of the billion row challenge, the goal of the trillion row challenge is to compute the min, max, and mean temperature per weather station, for 413 unique stations, from data stored as a collection of parquet files in blob storage. Data looks like this (but with 1,000,000,000,000 rows):

Cluster settings:

For this challenge we use used a 125 node cluster with 80 cpus and 320G ram per node. Underneath these are N4-standard-80 machines. T…

Similar Posts

Loading similar posts...