Last Friday at PyData Seattle 2025, I gave a talk called “Taming the Data Tsunami.” Room full of data engineers who’ve lived the same journey I have—Pandas on our laptops, then Dask clusters, then figuring out how to make everything talk to each other with Arrow.
The nods of recognition when I showed our successes? Expected. The uncomfortable silence when I showed what we’ve accidentally built? That caught me off guard.
We’ve been so good at democratizing data analysis that we’ve created an entirely different problem. Most of us haven’t noticed yet. The Journey From Here to There If you’ve worked with data in the past 15 years, you know this story.
**2008: **[**Pandas**]([https://pandas.pydata.org/?ref=distribut…
Last Friday at PyData Seattle 2025, I gave a talk called “Taming the Data Tsunami.” Room full of data engineers who’ve lived the same journey I have—Pandas on our laptops, then Dask clusters, then figuring out how to make everything talk to each other with Arrow.
The nods of recognition when I showed our successes? Expected. The uncomfortable silence when I showed what we’ve accidentally built? That caught me off guard.
We’ve been so good at democratizing data analysis that we’ve created an entirely different problem. Most of us haven’t noticed yet. The Journey From Here to There If you’ve worked with data in the past 15 years, you know this story.
**2008: **[Pandas**](https://pandas.pydata.org/?ref=distributedthoughts.org) changed everything**. Wes McKinney, frustrated at AQR Capital Management with the tools available for data work, built something better. Suddenly your entire dataset fit in memory and you could manipulate it like a spreadsheet that didn’t crash. No batch jobs. No Excel limitations. Just you, data, and a Jupyter notebook.
Exploratory analysis went from monologue to conversation.
**2014: **[Dask**](https://www.dask.org/?ref=distributedthoughts.org) and **[**Spark*](https://spark.apache.org/?ref=distributedthoughts.org)* gave us scale**. Data outgrew laptops. Single-machine ceilings became real problems. These frameworks solved it: partition your data, parallelize computation, process terabytes without waiting days. The Pandas API we loved now ran on clusters.
**2016: **[Arrow**](https://arrow.apache.org/?ref=distributedthoughts.org) gave us a common language**. The serialization tax between tools was brutal. Moving data from Pandas to Spark to your ML library meant conversion costs each time. Arrow provided zero-copy, columnar memory format that every tool could speak. Language barriers disappeared.
Each step solved a real bottleneck. Each unlocked new possibilities. The Python ecosystem became the standard because we kept solving the next problem. What We Built Instead Here’s what made that room go quiet: we’ve made “ingest-it-all-first” the default pattern. The numbers are bad.
IDC reports 175 zettabytes of global data this year. Not a typo. Meanwhile Gartner says only 29% of enterprise data has value for ANY business function.
Do the math: paying 100% to move, store, and process when only 29% matters.
Your Airflow DAGs, dbt models, carefully tuned ML pipelines—all forced to waste cycles filtering noise to find signal. 60% of enterprise data is unstructured and generated outside data centers. 73% of organizations can’t keep up with processing demand.
This isn’t optimization. It’s structural. Three Physics Problems We Keep Ignoring Goes beyond cloud bills. Three constraints our “centralize everything” approach pretends don’t exist: Data Is Everywhere Generated across multiple clouds, on-premise, edge devices. Central warehouse can’t be the only answer.
Moving ALL of it breaks down: - Egress fees compound - Network latency kills queries
- Managing dozens of ETL pipelines is brittle
Centralization logistics don’t work when you calculate actual costs. Speed of Light Is Constant Can’t make decisions faster than data travels. Wait for it to arrive? Already lost.
IoT and log data: massive noisy streams at the edge. Pay to transport the firehose. Filter to 5% that matters. Then take action.
Real-time is impossible by definition. Pay for junk transport. Over-provision central clusters for the flood. Regulations Have Teeth GDPR, CCPA don’t care about your architecture. EU customer data to US cluster for “cleaning”? Violation before you do anything useful.
“Toxic data lakes”: PII lands un-redacted in raw stores. High-value attack target. Insider threat nightmare. Compliance patched in afterward instead of built in.
“Almost compliant” means completely exposed. Not Replacing, Adding a Layer Careful here—not about replacing Pandas, Dask, Arrow. Revolutionary tools. Still essential. They gave us primitives. Need orchestration now.
Pandas: verbs for manipulation. Dask: scale. Arrow: grammar for communication. Missing: sentence structure. Framework for deciding what data deserves the conversation at all.
Python ecosystem gave us tools to analyze anywhere. Now be smart about WHICH data and WHERE. Three Implementation Patterns At PyData, walked through three patterns shifting control upstream—filtering, transformation, governance at source, before hitting expensive central infrastructure. Playbook #1: Distributed Warehousing Problem: Data scattered across regions and clouds. Moving all to query is slow, expensive.
Pattern: Store locally in open formats. Query federatively. Return only results.
Stack: - Iceberg, Delta Lake, Hudi for transactional tables on object storage - Trino and DuckDB for local queries, aggregated results only - Arrow for in-memory transport without serialization
Move query to data instead of data to query. Compute “near” where data lives. Final result—orders of magnitude smaller—crosses network.
IoT data to local Iceberg tables, processed in-place, summary stats only returned. Reduced storage, streamlined ingestion, queries that complete. Playbook #2: Streamlined Pipelines Problem: Log, metric, IoT data generates massive noise. Paying to transport and store before filtering.
Pattern: Filter and aggregate at source. Ship answer, not firehose.
Stack: - Vector or Benthos for edge collection and transformation - Embedded query engines (DuckDB, SQLite) running locally, complex SQL before transmission - Stream processors for real-time transformation
Process locally at generation point. Only small, high-signal results travel. Central clusters handle final aggregation, not raw stream.
Traditional approach in the deck: >1GB/day per server storage and egress. S3 interim storage. High-throughput compute. In-database processing compounding with scale.
Upstream: windowing, aggregation, filtering BEFORE transport. Lightweight agents on source machines. ETL pipeline unchanged—faster, cheaper.
$2.5M → $18K annual cost. 98%+ reduction. Benchmarks. Playbook #3: Upstream Governance Problem: Sensitive PII crosses network boundaries before redaction. “Almost compliant” = exposure windows regulations don’t forgive.
Pattern: Apply policy at source. Sanitize before leaving region.
Stack: - Open Policy Agent for declarative policy-as-code - Vector and Benthos with processors for obfuscation, hashing, PII filtering - Control plane (Expanso, Bacalhau) to deploy and audit globally
Define transformations as declarative config. Deploy to edge agents. Only SOC-compliant, sanitized data enters pipeline.
Multi-region compliance: EU data GDPR-compliant within EU boundaries. Only safe, compliant results to central stores. Data lake not security risk by default.
Showed the “almost compliant” pattern—centralize first, clean later—violates eight requirements: data deluge, privacy pitfalls, compliance chaos, audit nightmares, resource drain, performance lag, regulatory roulette, insight delays. Upstream addresses structurally. The Mindset Shift Tech isn’t hard. Tools exist, battle-tested. Hard part: changing how we think.
Python ecosystem trained us for a decade: “Get data somewhere I can work with it, THEN analyze.” Pandas reinforced. Dask scaled. Arrow optimized.
New model: “Decide what’s worth analyzing BEFORE paying to move and store.”
Funnel vs filter. Funnels collect everything, narrow later. Filters make intelligent source decisions. Where It Goes Talk ended, questions weren’t whether this works—cost numbers too compelling. How to retrofit existing systems. Convince teams. Prove value before committing.
Right concern. Not rip-and-replace. Additional layer making existing infrastructure efficient. Airflow DAGs still run. dbt models still compile. ML training still executes. Operate on 50-70% less data. Process faster. Cost fraction of before.
Python data stack democratized analysis. Pandas, Dask, Arrow made data work accessible and powerful. Like all successful systems, optimized for problems we had—created new ones.
Next evolution isn’t replacing what works. Adding control layer making success sustainable.
Because the alternative—continuing ingest-it-all-first as data doubles every two years—isn’t actually alternative.
Full [*PyData Seattle talk details](https://pydata.org/seattle2025?ref=distributedthoughts.org). Open-source playbook at [*Expanso](https://expanso.io/?ref=distributedthoughts.org)*) and [*Bacalhau](https://docs.bacalhau.org/?ref=distributedthoughts.org)*). Want to discuss specific data pipeline challenges? [*Email me](mailto:aronchick@expanso.io)—offering $50 gift cards for strategy sessions.
Originally published at Distributed Thoughts.