PyArrow for Large Dataset Processing

Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal memory format.

But, small and mighty though it might be, the pyarrow Python package is a force to be reckoned with. Capable of blasting through all sorts of cloud-based datasets. It’s not particularly a data transformation framework, as much as a way to represent core datasets, transferring data hither and thither over the wire from one format to another.

It goes without saying …

But, small and mighty though it might be, the pyarrow Python package is a force to be reckoned with. Capable of blasting through all sorts of cloud-based datasets. It’s not particularly a data transformation framework, as much as a way to represent core datasets, transferring data hither and thither over the wire from one format to another.

It goes without saying that pyarrow is simple to install and use.

Today, I just want to bring your attention to some basic usage of PyArrow and where I’ve found it helpful, especially for reading and writing large datasets in the cloud.

PyArrow Datasets

One of my goto use cases is the PyArrow Dataset. It is especially useful over large sets of contiguous files, CSVs, Parquets, TXT, etc. This also includes large sets of files in cloud storage like S3.

Easy, peasy, lemon squezzy.

PyArrow is the goat at fast reading large datasets of all file types from local or remote storage buckets. Makes it simple and easy. No need to iterate through files.

PyArrow RecordBatches

The next best feature of PyArrow works after you have either read or written large datasets made up of many files, and maybe even larger than memory datasets. Batches.

I have found this helpful when working on larger-than-memory datasets on small machines and servers, where you don’t want to OOM or use up all resources on an Airflow worker, for example. What is even better is that you can batch-throw the Arrow records into another tool, since everything respects Arrow, and then do other processing, like inserting data into Apache Iceberg or Delta Lake, for example.

As you can see, nothing to it. At this point, you can simply work on batches as you please. Doing whatever you want, I recently used this to push commits into an Iceberg table.

PyArrow is under-appreciated

Sure, PyArrow is a little rough around the edges; it doesn’t have all the features of Polars or DuckDB, but that’s ok. It’s still a great choice for very specific use cases, especially for dealing with file gluts in cloud storage. Most modern tooling supports Arrow data copying in a “zero” manner, making it a great choice for transitioning between different frameworks as needed.

PyArrow Datasets

PyArrow RecordBatches

PyArrow is under-appreciated

Similar Posts