Building a Distributed Database in Elixir (Part 1: Motivation and Challenges)

5 min read3 hours ago

–

The Itch That Needed Scratching

After 20+ years writing software, I’ve integrated with dozens of databases. PostgreSQL for relational data, Redis for caching, Elasticsearch for search, Kafka for events. Each solves specific problems brilliantly, but here’s what bothered me: every new project starts with the same architectural discussion about which combination of these tools to deploy.

What if we could build something that understands the patterns underneath? A distributed database that leverages Elixir’s OTP primitives for fault tolerance while using battle-tested storage engines like RocksDB for persistence.

Press enter or click to view image in full size

This isn’t about replacing PostgreSQL or MongoDB in production tomorrow. It’s about unders…

5 min read3 hours ago

–

The Itch That Needed Scratching

Press enter or click to view image in full size

This isn’t about replacing PostgreSQL or MongoDB in production tomorrow. It’s about understanding distributed systems from first principles and exploring whether Elixir’s concurrency model offers a fundamentally different approach to building databases.

Why Elixir for Databases?

The functional programming community often overlooks databases as a domain. But Elixir gives us:

OTP for free: Supervision trees, process isolation, and fault tolerance are language primitives
Distribution built-in: Erlang’s distribution protocol handles node communication
Immutability: Makes reasoning about distributed state much simpler
Hot code reloading: Deploy fixes without downtime

Most databases bolt on distribution as an afterthought. What happens when you start with it?

The Known Challenges (And Why They’re Interesting)

Building a distributed database isn’t just hard. It’s a buffet of computer science problems:

1. CAP Theorem Isn’t Optional

You can’t have Consistency, Availability, and Partition tolerance simultaneously. Every design decision is a trade-off. Do you accept stale reads for availability? Do you block writes during network splits?

For my project, I’m exploring tunable consistency and let applications choose their trade-offs per query rather than at the database level.

2. Consensus Is Expensive

Raft, Paxos, and their variants solve distributed consensus, but every agreement requires multiple network round trips. When do you need consensus vs. eventual consistency?

I’m investigating how Elixir’s process model can minimize consensus overhead for operations that don’t strictly need it.

3. Node Failures Are the Norm, Not the Exception

In a distributed system, nodes will fail. Disks die, networks drop, processes crash. The question isn’t “if” but “when” and “how often.”

What happens when a node crashes mid-write? How do you detect failures vs. slow nodes? How quickly can you promote replicas? And the trickiest part: what happens when a “failed” node comes back online with stale data?

Elixir’s supervision trees give us a foundation, but we still need to handle:

Detection mechanisms (heartbeats, gossip protocols)
Quorum-based operations to tolerate failures
Automatic failover without losing committed data
Recovery and catch-up protocols

4. Conflict Resolution — When Reality Branches

Two nodes receive conflicting writes during a network partition. Both accept the writes because availability matters. The partition heals. Now what?

This is where things get philosophically interesting:

Last-write-wins: Simple but loses data
Vector clocks: Track causality, but who resolves conflicts?
CRDTs: Conflict-free by design, but limited operations
Application-level resolution: Most flexible, most complex

Different data types need different strategies. A counter can merge automatically. A text document needs human judgment. Your database needs to handle both.

5. Data Placement and Sharding

Where does each piece of data live? Consistent hashing, range partitioning, and virtual nodes all have trade-offs. Rebalancing data without downtime when nodes join or leave is its own nightmare.

And it’s not just about even distribution, you need to consider:

Data locality for queries
Replication factor (how many copies?)
Hot shards that get disproportionate load
Geographic distribution for latency

6. Transactions Across Nodes

Single-node transactions are solved problems. Distributed transactions require two-phase commit protocols that can fail in creative ways. Network partitions during commit are especially fun.

The coordinator node can fail between prepare and commit phases. Do other nodes wait forever? Timeout and risk inconsistency? This is where consensus protocols become necessary, adding latency to every write.

7. Network Partitions — The Split-Brain Problem

Your cluster splits into two groups that can’t communicate. Both think they’re the legitimate cluster. Both accept writes. Congratulations, you have two divergent realities.

Solutions include:

Quorum systems (majority must agree)
Fencing tokens to prevent zombie writes
Human intervention (least automated, most painful)

The Brazilian fintech space taught me this the hard way, network reliability in data centers isn’t guaranteed, and mobile deployments are worse.

8. Replication Lag and Read-Your-Writes

You write data to the primary node. You immediately read it back but hit a replica. The data isn’t there yet. Your user sees their action fail.

How do you guarantee read-your-writes consistency without forcing all reads to the primary? Session affinity? Version vectors? Waiting for replication confirmation?

9. Query Optimization in Distributed Context

Traditional query optimizers assume data locality. When tables are sharded across nodes, moving data vs. moving computation becomes critical. Network becomes your bottleneck, not disk.

Do you push computation to the data? Pull data to a coordinator? How do you handle joins across shards? What about aggregations?

10. Time and Ordering

Distributed systems have no global clock. Event ordering becomes ambiguous. Did A happen before B, or is that just network delay?

Physical clocks drift. Logical clocks (Lamport, vector clocks) help but don’t solve everything. Hybrid logical clocks try to give you both. Each adds complexity.

11. Operational Complexity

Monitoring, debugging, and diagnosing issues in distributed systems is exponentially harder. When a query is slow, is it the network? A slow node? Lock contention? All three?

You need:

Distributed tracing across nodes
Metrics aggregation
Log correlation
Anomaly detection
Capacity planning that accounts for replication overhead

And you need all of this before you have users, because debugging under load is too late.

What Success Looks Like

I’m not trying to build the next Cassandra or CockroachDB. Success for this project means:

Learning: Deeply understanding distributed systems trade-offs
Working code: A functional proof-of-concept that handles basic operations across multiple nodes, survives node failures, and resolves conflicts sensibly
Documentation: Sharing the journey and lessons learned
Reusable patterns: Code that others can learn from and fork
Battle scars: Real experience with the edge cases that only appear when things break

The goal is exploration, not production-readiness. But who knows? Sometimes side projects become serious tools. And even if this never sees production, understanding these problems deeply makes you a better engineer.

Why This Matters (Beyond the Tech)

In the Brazilian market, I’ve seen companies pay massive costs for database infrastructure that doesn’t fit their actual needs. Understanding these trade-offs means making better architectural decisions even when using existing databases.

Plus, there’s something deeply satisfying about building something from scratch, watching it break in interesting ways, and making it work again.

Next Steps

In the next post, I’ll dive into the architecture fundamentals, how Elixir’s process model maps to database operations, and why OTP supervision trees might be the secret weapon for fault-tolerant storage.

Have you built distributed systems? What challenges surprised you most? Which of these problems seems most interesting to you? Let me know in the comments.

Just don´t wait on me, creating the database is probably going to be enough of a challenge. The post is going to be out, when it is supposed to be out.

The Itch That Needed Scratching

The Itch That Needed Scratching

Why Elixir for Databases?

The Known Challenges (And Why They’re Interesting)

1. CAP Theorem Isn’t Optional

2. Consensus Is Expensive

3. Node Failures Are the Norm, Not the Exception

4. Conflict Resolution — When Reality Branches

5. Data Placement and Sharding

6. Transactions Across Nodes

7. Network Partitions — The Split-Brain Problem

8. Replication Lag and Read-Your-Writes

9. Query Optimization in Distributed Context

10. Time and Ordering

11. Operational Complexity

What Success Looks Like

Why This Matters (Beyond the Tech)

Next Steps

Similar Posts