Spreadsheet Version Control for Teams

I kept seeing the same pattern across very different teams.

Whenever people needed to answer a new question, they reached for a spreadsheet. Not because it was the "right" tool, but because it was the fastest one.

Need an extra field? Add a column. Need a new calculation? Copy a formula. Need to compare versions? Duplicate the file.

At first, this works remarkably well.

Spreadsheets are flexible, forgiving, and accessible. They let people move forward without waiting for schemas, migrations, or approvals.

Over time, though, something interesting happens.

The spreadsheet stops being just a spreadsheet. It starts storing large datasets. It accumulates historical versions. It encodes business logic in formulas. It becomes the place where "the data lives".

At that poi…

I kept seeing the same pattern across very different teams.

Whenever people needed to answer a new question, they reached for a spreadsheet. Not because it was the "right" tool, but because it was the fastest one.

Need an extra field? Add a column. Need a new calculation? Copy a formula. Need to compare versions? Duplicate the file.

At first, this works remarkably well.

Spreadsheets are flexible, forgiving, and accessible. They let people move forward without waiting for schemas, migrations, or approvals.

Over time, though, something interesting happens.

At that point, the spreadsheet has quietly become a database, a transformation layer, and an analytical engine — all at once.

This usually isn’t a conscious design choice. It’s a pragmatic response to friction elsewhere.

I lived this firsthand. Every month, I had to roll forward financial data — 50,000 lines from month to month-1. The workflow looked like this: pull the new data, remap columns, build intermediate summary tables, feed the main dashboard. Repeat across multiple files.

The actual analysis took maybe 10% of my time. The other 90% was just reworking the data — copy, paste, adjust references, wait for recalculations, pray nothing broke.

A task that should have taken minutes regularly consumed hours.

Where spreadsheets technically break

The problems that appear at this stage are often described as "Excel issues". In reality, they’re technical limitations.

Spreadsheets were never designed to handle analytical workloads at scale:

execution is largely memory-bound
limited parallelism
full dependency graph recalculation
fragile cross-sheet references
file-based versioning
no lineage or reproducibility

As datasets grow, performance degrades non-linearly. Small changes trigger full recalculations. Files become slow to open, slow to save, sometimes impossible to recover.

Debugging becomes guesswork. Was a number wrong because the data changed? Because a formula broke? Because a reference silently shifted?

None of this is a UX problem. These are compute problems.

Why SQL databases don’t fully solve this

The usual recommendation: "If spreadsheets don’t scale, put the data in a database."

Often correct. SQL databases provide structure, consistency, and governance.

But they don’t solve the problem that pushed people toward spreadsheets in the first place.

The issue isn’t storage. It’s iteration speed.

Adding a field requires schema updates. Changing a calculation affects downstream models. Historical data needs backfilling. Pipelines must be tested and redeployed.

For exploratory or time-sensitive analysis, this friction matters. A question raised during a meeting might need an answer within days — but even small changes in a warehouse setup can take weeks.

SQL itself also becomes a barrier. Many people understand data perfectly well but don’t want to write queries, manage joins, or reason about execution plans. Analytical work gets centralized around a few specialists.

Ironically, spreadsheets reappear as a workaround — rebuilding local versions of the same datasets just to regain iteration speed.

The real problem: analytical compute

The pattern becomes clearer. The data often already exists. Storage is cheap. Tools are available.

What keeps breaking down is compute.

Analytical workflows require repeated scans, aggregations, recalculations, and comparisons. These operations are compute-heavy and highly iterative.

Users aren’t asking for permanent transformations. They’re asking to recompute logic quickly, repeatedly, and predictably.

Most frustrations around data tooling come from the absence of a fast analytical compute layer. Not better dashboards. Not more storage. Faster execution.

Why DuckDB fits this gap

This is where DuckDB becomes interesting.

DuckDB is an in-process analytical database optimized for OLAP workloads. Like SQLite, it runs embedded — but designed for analytical compute rather than transactions.

That design has important implications:

no network latency
no server to manage
direct access to local files
synchronous execution

Technically, DuckDB combines columnar execution, vectorized processing, efficient aggregations, and direct querying of Parquet and CSV.

For analytical workloads, performance isn’t incremental — it’s often orders of magnitude faster than spreadsheets. More importantly, it’s predictable. No fragile dependency graph, no cascading recalculation, no sudden performance cliff when one more column is added.

DuckDB doesn’t replace data warehouses. It complements them by focusing on fast, local analytical compute — a missing middle layer between spreadsheets and traditional databases.

The challenge: exposing power without exposing SQL

At first glance, SQL solves everything. But most users don’t actually want SQL.

They don’t want queries. They want results. Reusable datasets. Versioned calculations. Comparable outputs.

SQL is an implementation detail.

Exposing raw SQL limits powerful engines to technical users. Hiding too much creates rigidity.

The challenge is finding an abstraction where computation remains explicit, results are reproducible, logic can evolve incrementally — and non-technical users don’t need to think in queries.

In that model, SQL exists as an internal representation, not a user-facing interface. Users define transformations. The engine handles execution.

What I learned building around DuckDB

DuckDB delivers remarkable performance with very little infrastructure. For many workloads, a single embedded engine replaced setups that previously required pipelines and services.

At the same time, its limits are real. Memory pressure matters. Concurrency must be managed explicitly. Schema evolution requires careful design. Versioning is harder than it looks.

DuckDB provides compute — not semantics. It’s closer to a compiler backend than a full data platform. Once treated that way, its strengths and limitations become much easier to reason about.

Closing thoughts

Not every problem needs a warehouse. Not every dataset needs pipelines. Not every user needs SQL.

Sometimes what’s missing is simply fast analytical execution with minimal overhead.

Spreadsheets will likely remain the primary interface for many teams — and that’s fine. The goal isn’t to replace Excel. It’s to stop asking it to do compute it was never designed for, and let results flow back into the tools people already use.

I’ve been building around these ideas — happy to share more if there’s interest.