We Lost the Thread on the Data Lake

In 2014, my last startup was acquired. We joined a fast growing organization with a top-notch data team. They had invested heavily in data infrastructure. Data was strategic. They had "the hub," a Hadoop cluster built on HDFS. I thought: here’s a company doing things right.

Then I tried to analyze monthly active users. I found no less than 20 paths containing some version of MAU or "monthly active user" in the name. Some were kilobytes, many were multi-megabytes. All had different shapes and schemas when introspected. Dates were inconsistent or downright lying. Who created them? How were they computed? Which ones were the actual metrics being used by the business?

After some digital archeology, I …

After some digital archeology, I eventually found some definitions. But reconstructing values to build trust proved impossible. Many input data sources were ephemeral, long gone by the time I wanted to understand and verify computations. This is the swamp: lots of data, none of it useful or trustworthy.

What We Were Promised

When James Dixon coined the term "data lake" in 2010, he articulated a powerful idea: store data in its raw form and apply structure only when reading it. Schema-on-read, not schema-on-write. Use cheap storage and defer the costly data modeling that traditionally preceded warehouse loading.

This concept promised infinite flexibility. Organizations could preserve data without committing to specific use cases upfront, enabling adaptation as requirements evolved.

But without metadata or governance, lakes became swamps. The industry’s response was misguided. We decided to turn data lakes into databases (and called them lakehouses). We introduced table formats, enforced schemas at write time, and built query engines to recreate analytical databases on object storage. In many cases poorly.

Purpose-built analytical databases already existed and in most cases performed better. By centralizing everything into a single database model, we recreated the exact situation the data lake was meant to solve. We gave up the flexibility. We’re building one-size-fits-all systems in an industry that seems to repeatedly forget a fundamental engineering truth: the best database depends on your workload.

Back To The Original Promise

Matterbeam returns to the original vision of the data lake while solving its operational challenges. Three core concepts work together to make this practical:

Intermediate structure with schema inference. All data can be collected, streaming, without pre-specified schemas. Because the system stores everything immutably, it observes the actual shape of your data over time and infers schemas and schema evolution automatically when not provided by source systems. This gets stored in an intermediate format that preserves the original structure while adding metadata necessary for discovery and understanding. Data quality can be enforced flexibly, watching when data deviates from inferred schemas, without losing data and without constraining upstream systems.

This intermediate format isn’t optimized for query performance. It’s not a table format. It’s optimized for rapid materialization into whatever structures your use cases demand: relational tables for analytics, event streams for real-time processing, document structures for search, training datasets for ML and then keeping those materializations in sync. The format serves as a high-fidelity source that can generate, and regenerate, fit-for-purpose views.

Immutability as the foundation. Classic lakes quietly became mutable systems. Data was overwritten, compacted, remodeled, and reingested until the raw truth disappeared. Matterbeam’s immutable log preserves data as it arrived. This enables schema inference to work reliably. With complete history, the system learns schemas from actual data rather than requiring upfront declarations. Immutability also guarantees that materializations remain reproducible.

Replay as a first-class citizen. When your first materialization isn’t quite right, replay lets you regenerate it. The intermediate format translates quickly into new structures precisely because it preserved the original shape. Immutability ensures the source remains intact for replay. Together, these properties create a feedback loop: materialize, learn, refine, replay. Time becomes an asset, not a constraint.

This approach enables multiple truths to coexist without conflict. Teams evolve independently. This version of a data lake supports exploration and experimentation, not just predetermined dashboards. Views can change and be replayed to map to new realities and new use cases.

The original data lake promised "store everything now, decide later." That vision devolved into swamps and vendor lock-in to half-built databases that try to do everything, but nothing best of class.

Matterbeam delivers a better promise: store everything immutably, decide repeatedly, and never lose the past. Use the right system or database when you need it, with the data in whatever shape you need, as soon as you need it.

What We Were Promised

Back To The Original Promise

Similar Posts