5 min read3 hours ago
–
The Itch That Needed Scratching
After 20+ years writing software, I’ve integrated with dozens of databases. PostgreSQL for relational data, Redis for caching, Elasticsearch for search, Kafka for events. Each solves specific problems brilliantly, but here’s what bothered me: every new project starts with the same architectural discussion about which combination of these tools to deploy.
What if we could build something that understands the patterns underneath? A distributed database that leverages Elixir’s OTP primitives for fault tolerance while using battle-tested storage engines like RocksDB for persistence.
Press enter or click to view image in full size
This isn’t about replacing PostgreSQL or MongoDB in production tomorrow. It’s about unders…
5 min read3 hours ago
–
The Itch That Needed Scratching
After 20+ years writing software, I’ve integrated with dozens of databases. PostgreSQL for relational data, Redis for caching, Elasticsearch for search, Kafka for events. Each solves specific problems brilliantly, but here’s what bothered me: every new project starts with the same architectural discussion about which combination of these tools to deploy.
What if we could build something that understands the patterns underneath? A distributed database that leverages Elixir’s OTP primitives for fault tolerance while using battle-tested storage engines like RocksDB for persistence.
Press enter or click to view image in full size
This isn’t about replacing PostgreSQL or MongoDB in production tomorrow. It’s about understanding distributed systems from first principles and exploring whether Elixir’s concurrency model offers a fundamentally different approach to building databases.
Why Elixir for Databases?
The functional programming community often overlooks databases as a domain. But Elixir gives us:
- OTP for free: Supervision trees, process isolation, and fault tolerance are language primitives
- Distribution built-in: Erlang’s distribution protocol handles node communication
- Immutability: Makes reasoning about distributed state much simpler
- Hot code reloading: Deploy fixes without downtime
Most databases bolt on distribution as an afterthought. What happens when you start with it?
The Known Challenges (And Why They’re Interesting)
Building a distributed database isn’t just hard. It’s a buffet of computer science problems:
1. CAP Theorem Isn’t Optional
You can’t have Consistency, Availability, and Partition tolerance simultaneously. Every design decision is a trade-off. Do you accept stale reads for availability? Do you block writes during network splits?
For my project, I’m exploring tunable consistency and let applications choose their trade-offs per query rather than at the database level.
2. Consensus Is Expensive
Raft, Paxos, and their variants solve distributed consensus, but every agreement requires multiple network round trips. When do you need consensus vs. eventual consistency?
I’m investigating how Elixir’s process model can minimize consensus overhead for operations that don’t strictly need it.
3. Node Failures Are the Norm, Not the Exception
In a distributed system, nodes will fail. Disks die, networks drop, processes crash. The question isn’t “if” but “when” and “how often.”
What happens when a node crashes mid-write? How do you detect failures vs. slow nodes? How quickly can you promote replicas? And the trickiest part: what happens when a “failed” node comes back online with stale data?
Elixir’s supervision trees give us a foundation, but we still need to handle:
- Detection mechanisms (heartbeats, gossip protocols)
- Quorum-based operations to tolerate failures
- Automatic failover without losing committed data
- Recovery and catch-up protocols
4. Conflict Resolution — When Reality Branches
Two nodes receive conflicting writes during a network partition. Both accept the writes because availability matters. The partition heals. Now what?
This is where things get philosophically interesting:
- Last-write-wins: Simple but loses data
- Vector clocks: Track causality, but who resolves conflicts?
- CRDTs: Conflict-free by design, but limited operations
- Application-level resolution: Most flexible, most complex
Different data types need different strategies. A counter can merge automatically. A text document needs human judgment. Your database needs to handle both.
5. Data Placement and Sharding
Where does each piece of data live? Consistent hashing, range partitioning, and virtual nodes all have trade-offs. Rebalancing data without downtime when nodes join or leave is its own nightmare.
And it’s not just about even distribution, you need to consider:
- Data locality for queries
- Replication factor (how many copies?)
- Hot shards that get disproportionate load
- Geographic distribution for latency
6. Transactions Across Nodes
Single-node transactions are solved problems. Distributed transactions require two-phase commit protocols that can fail in creative ways. Network partitions during commit are especially fun.
The coordinator node can fail between prepare and commit phases. Do other nodes wait forever? Timeout and risk inconsistency? This is where consensus protocols become necessary, adding latency to every write.
7. Network Partitions — The Split-Brain Problem
Your cluster splits into two groups that can’t communicate. Both think they’re the legitimate cluster. Both accept writes. Congratulations, you have two divergent realities.
Solutions include:
- Quorum systems (majority must agree)
- Fencing tokens to prevent zombie writes
- Human intervention (least automated, most painful)
The Brazilian fintech space taught me this the hard way, network reliability in data centers isn’t guaranteed, and mobile deployments are worse.
8. Replication Lag and Read-Your-Writes
You write data to the primary node. You immediately read it back but hit a replica. The data isn’t there yet. Your user sees their action fail.
How do you guarantee read-your-writes consistency without forcing all reads to the primary? Session affinity? Version vectors? Waiting for replication confirmation?
9. Query Optimization in Distributed Context
Traditional query optimizers assume data locality. When tables are sharded across nodes, moving data vs. moving computation becomes critical. Network becomes your bottleneck, not disk.
Do you push computation to the data? Pull data to a coordinator? How do you handle joins across shards? What about aggregations?
10. Time and Ordering
Distributed systems have no global clock. Event ordering becomes ambiguous. Did A happen before B, or is that just network delay?
Physical clocks drift. Logical clocks (Lamport, vector clocks) help but don’t solve everything. Hybrid logical clocks try to give you both. Each adds complexity.
11. Operational Complexity
Monitoring, debugging, and diagnosing issues in distributed systems is exponentially harder. When a query is slow, is it the network? A slow node? Lock contention? All three?
You need:
- Distributed tracing across nodes
- Metrics aggregation
- Log correlation
- Anomaly detection
- Capacity planning that accounts for replication overhead
And you need all of this before you have users, because debugging under load is too late.
What Success Looks Like
I’m not trying to build the next Cassandra or CockroachDB. Success for this project means:
- Learning: Deeply understanding distributed systems trade-offs
- Working code: A functional proof-of-concept that handles basic operations across multiple nodes, survives node failures, and resolves conflicts sensibly
- Documentation: Sharing the journey and lessons learned
- Reusable patterns: Code that others can learn from and fork
- Battle scars: Real experience with the edge cases that only appear when things break
The goal is exploration, not production-readiness. But who knows? Sometimes side projects become serious tools. And even if this never sees production, understanding these problems deeply makes you a better engineer.
Why This Matters (Beyond the Tech)
In the Brazilian market, I’ve seen companies pay massive costs for database infrastructure that doesn’t fit their actual needs. Understanding these trade-offs means making better architectural decisions even when using existing databases.
Plus, there’s something deeply satisfying about building something from scratch, watching it break in interesting ways, and making it work again.
Next Steps
In the next post, I’ll dive into the architecture fundamentals, how Elixir’s process model maps to database operations, and why OTP supervision trees might be the secret weapon for fault-tolerant storage.
Have you built distributed systems? What challenges surprised you most? Which of these problems seems most interesting to you? Let me know in the comments.
Just don´t wait on me, creating the database is probably going to be enough of a challenge. The post is going to be out, when it is supposed to be out.