Analyzing and Optimizing a Parquet ClickHouse Ingestion Pipeline

Hey devs, Recently I got the chance to analyze an existing ingestion pipeline that loads large Parquet files into a ClickHouse database.

I didn’t write the pipeline from scratch - my role was to understand how it worked, identify bottlenecks, and figure out how it could be made faster and more scalable.

This post walks through how I analyzed the pipeline and the key performance lessons I learned.

Understanding the Pipeline

At a high level, the pipeline does the following:

Reads Parquet files from a folder
Performs on-the-go transformations and calculations
Ingests the processed data into a ClickHouse table

Simple in theory - but performance issues tend to hide in the details.

Identifying the Bottlenecks

While analyzing the pipeline, a clear pattern…

Hey devs, Recently I got the chance to analyze an existing ingestion pipeline that loads large Parquet files into a ClickHouse database.

I didn’t write the pipeline from scratch - my role was to understand how it worked, identify bottlenecks, and figure out how it could be made faster and more scalable.

This post walks through how I analyzed the pipeline and the key performance lessons I learned.

Understanding the Pipeline

At a high level, the pipeline does the following:

Reads Parquet files from a folder
Performs on-the-go transformations and calculations
Ingests the processed data into a ClickHouse table

Simple in theory - but performance issues tend to hide in the details.

Identifying the Bottlenecks

While analyzing the pipeline, a clear pattern stood out:

Row-by-row processing in Python

Many transformations were executed one row at a time
Python loops and .apply() were heavily used
Each row triggered Python-level execution

This creates a CPU-bound bottleneck - Python processes rows sequentially, which does not scale well for large datasets.

Sub-optimal ingestion patterns

Inserts were not fully optimized for batch throughput
Smaller inserts increased network and coordination overhead

ClickHouse is extremely fast - but only when it receives data the right way.

Optimization Concepts That Matter

Instead of focusing on code, these are the engineering concepts that made the biggest difference.

Vectorized Operations

Pandas and NumPy are designed to work on entire columns at once, not row by row.

Vectorized operations run in optimized C code
They avoid Python’s loop overhead
They can be orders of magnitude faster

Whenever possible, column-level operations should replace .apply() and explicit loops.

Batch Inserts into ClickHouse

Databases perform better when data arrives in large batches.

Fewer round trips
Less connection overhead
Better compression and write efficiency

Batching is essential to fully utilize ClickHouse’s ingestion speed.

Parallel Processing for CPU-Heavy Work

Some transformations are inherently CPU-heavy.

These tasks benefit from multiprocessing
Work can be split across CPU cores
Total processing time drops significantly

This is especially important when transformations cannot be fully vectorized.

Smarter File Handling

File-based ingestion pipelines can suffer from operational overhead:

Repeatedly scanning large folders
Tracking which files are processed

A clean workflow that separates incoming, processing, and processed files reduces both complexity and I/O overhead.

Key Lessons Learned

From analyzing this pipeline, a few clear lessons stood out:

Python row-by-row processing does not scale
Vectorization should be the default approach
Batch inserts unlock database performance
Multiprocessing helps with CPU-bound workloads
Simple file-handling strategies prevent long-term pain

You don’t need to rewrite everything - understanding where time is spent is often enough to unlock major gains.

Final Thoughts

You don’t always need to build systems from scratch to learn deeply.

Analyzing an existing ETL pipeline - understanding where it slows down and why - can teach you just as much about performance, scalability, and real-world data engineering.

These lessons apply far beyond Parquet and ClickHouse - they’re useful for any large-scale ingestion pipeline.

About Me

Mohamed Hussain S

🔗 LinkedIn • GitHub

I’m still learning and figuring things out as a associate data engineer.

If you like real-world content - analyzing systems, understanding bottlenecks, and learning from actual pipelines instead of theory-only stuff - feel free to follow me.

Understanding the Pipeline

Identifying the Bottlenecks

Understanding the Pipeline

Identifying the Bottlenecks

Row-by-row processing in Python

Sub-optimal ingestion patterns

Optimization Concepts That Matter

Vectorized Operations

Batch Inserts into ClickHouse

Parallel Processing for CPU-Heavy Work

Smarter File Handling

Key Lessons Learned

Final Thoughts

About Me

Similar Posts