8 min read15 hours ago
–
Press enter or click to view image in full size
The BEAM: A Virtual Machine Built for Distribution
Before we dive into database specifics, let’s talk about why Elixir is uniquely positioned for this challenge. Elixir runs on the BEAM (Bogdan/Björn’s Erlang Abstract Machine), a runtime that was designed in the 1980s for telecom systems that could never go down.
That heritage matters. Telecom switches need:
- Massive concurrency (handling millions of simultaneous calls)
- Fault isolation (one dropped call shouldn’t crash the system)
- Hot code reloading (you can’t shut down a phone network for upgrades)
- Distribution (switches are inherently distributed)
Sound familiar? These are exactly the properties we need for a distributed database.
##…
8 min read15 hours ago
–
Press enter or click to view image in full size
The BEAM: A Virtual Machine Built for Distribution
Before we dive into database specifics, let’s talk about why Elixir is uniquely positioned for this challenge. Elixir runs on the BEAM (Bogdan/Björn’s Erlang Abstract Machine), a runtime that was designed in the 1980s for telecom systems that could never go down.
That heritage matters. Telecom switches need:
- Massive concurrency (handling millions of simultaneous calls)
- Fault isolation (one dropped call shouldn’t crash the system)
- Hot code reloading (you can’t shut down a phone network for upgrades)
- Distribution (switches are inherently distributed)
Sound familiar? These are exactly the properties we need for a distributed database.
Processes: Lightweight and Isolated
In Elixir, everything runs in processes. Not OS processes — these are BEAM processes, which are:
- Extremely lightweight: ~2KB of memory each, you can run millions on a single machine
- Completely isolated: one process crashing doesn’t affect others
- Share nothing: communication happens via message passing
- Scheduled fairly: the BEAM preempts long-running processes
For a database, this means:
- Each client connection can be its own process
- Each shard or partition can be managed by dedicated processes
- Query execution can be parallelized trivially
- Failures are isolated — a bad query crashes its process, not your database
# A simple example: each key-value pair could be its own processdefmodule KVStore do use GenServer def start_link(key) do GenServer.start_link(__MODULE__, %{}, name: via_tuple(key)) end defp via_tuple(key) do {:via, Registry, {KVRegistry, key}} end def put(key, value) do GenServer.call(via_tuple(key), {:put, value}) end def get(key) do GenServer.call(via_tuple(key), :get) endend
OTP: The Framework We Don’t Have to Build
OTP (Open Telecom Platform — ignore the name, it’s generic now) is the secret sauce. It provides battle-tested patterns for building fault-tolerant systems.
Supervision Trees: Fault Tolerance by Design
A supervisor is a process that monitors other processes and restarts them when they crash. You compose these into trees that define your system’s fault tolerance strategy.
defmodule MyDB.Supervisor do use Supervisor def start_link(init_arg) do Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__) end def init(_init_arg) do children = [ # Storage layer processes {MyDB.Storage.Supervisor, []}, # Cluster membership and gossip {MyDB.Cluster.Membership, []}, # Metadata Raft cluster (CP for schemas, collections, ring state) {MyDB.Metadata.RaftCluster, []}, # Query coordinator {MyDB.Query.Coordinator, []}, # Replication manager (AP for data plane) {MyDB.Replication.Manager, []}, # Anti-entropy (for eventual consistency) {MyDB.AntiEntropy, []} ] Supervisor.init(children, strategy: :one_for_one) endend
What’s beautiful here is the restart strategies:
:one_for_one- if one child dies, restart only that child:one_for_all- if one child dies, restart all children:rest_for_one- if a child dies, restart it and all children started after it
For databases, this means defining exactly how your system recovers from failures. Storage process crashes? Restart it and reload from disk. Entire shard coordinator fails? Restart the whole coordination subsystem.
GenServer: State Machines Made Easy
GenServer (Generic Server) is the pattern for building stateful processes. Most database components map naturally to GenServers:
defmodule MyDB.Shard do use GenServer # Shard state includes: # - Local RocksDB handle # - Replication log position # - Read/write locks # - Pending transactions def init(shard_id) do {:ok, db} = RocksDB.open("/data/shard_#{shard_id}") state = %{ shard_id: shard_id, db: db, replicas: [], version_vector: %{} } {:ok, state} end def handle_call({:get, key}, _from, state) do case RocksDB.get(state.db, key) do {:ok, value} -> {:reply, {:ok, value}, state} :not_found -> {:reply, :not_found, state} end end def handle_call({:put, key, value}, _from, state) do :ok = RocksDB.put(state.db, key, value) new_state = update_version_vector(state) replicate_to_peers(key, value, state.replicas) {:reply, :ok, new_state} endend
Native Clustering: Distribution Without Libraries
Here’s where it gets interesting. The BEAM has built-in distribution. You can connect nodes together and they can communicate transparently:
# Start a nodeiex --name db1@192.168.1.10 --cookie secret_token# Connect to another nodeNode.connect(:"db2@192.168.1.11")# Now you can call functions on remote nodesNode.list() # [:db2@192.168.1.11, :db3@192.168.1.12]# Spawn a process on a remote nodepid = Node.spawn(:"db2@192.168.1.11", fn -> IO.puts("Running on db2")end)# Send messages to processes anywhere in the clustersend({:shard_manager, :"db2@192.168.1.11"}, {:replicate, key, value})
This is huge for building distributed databases. You don’t need to:
- Serialize/deserialize messages manually
- Implement RPC protocols
- Manage connection pools
- Handle reconnection logic (the BEAM does it)
The BEAM’s distribution gives you:
- Transparent remote calls:
GenServer.call({:shard_1, :node2}, :get)works whether the process is local or remote - Location transparency: processes have PIDs that work across nodes
- Automatic reconnection: nodes reconnect automatically after network partitions
- Built-in authentication: the shared cookie mechanism
The Ecosystem: Standing on Giants’ Shoulders
Elixir’s ecosystem has matured with libraries specifically for distributed systems:
libcluster: Automatic Cluster Formation
libcluster handles cluster topology and node discovery:
config :libcluster, topologies: [ mydb_cluster: [ strategy: Cluster.Strategy.Gossip, config: [ port: 45892, multicast_addr: "230.1.1.251", # Or use DNS, Kubernetes, Consul, etc. ] ] ]
This solves the “how do nodes find each other?” problem. In production, you might use:
- Kubernetes: nodes discover via K8s API
- DNS: query for service records
- Consul/etcd: service discovery
- Gossip: multicast for local networks
Phoenix.PubSub: Cluster-Wide Messaging
Phoenix.PubSub provides pub/sub across your cluster:
# Subscribe to replication eventsPhoenix.PubSub.subscribe(MyDB.PubSub, "shard:#{shard_id}")# Publish to all nodesPhoenix.PubSub.broadcast( MyDB.PubSub, "shard:#{shard_id}", {:replicate, key, value, version})
This is perfect for:
- Change data capture (CDC)
- Notifying replicas about writes
- Coordinating distributed transactions
- Gossip protocols for cluster state
The beautiful part: it works the same whether you have 2 nodes or 200.
Other Ecosystem Gems
- :pg: Built-in process groups for clustering
- Horde: Distributed process registry with CRDT-based conflict resolution (though I’m not using it — more on that below)
- Swarm: Another distributed process registry with handoff support
- Riak Core: The framework from Riak DB, implementing consistent hashing and vnodes
- Nebulex: Distributed caching library with various backends
- ExHashRing: Consistent hashing implementations
- Partisan: Alternative distribution layer for large clusters (not using yet, but keeping an eye on it)
My Architecture Decisions: Custom Gossip and Hybrid Consistency
I’m taking a somewhat unconventional approach that’s worth explaining:
Custom Gossip Instead of Horde
While Horde is excellent for distributed process registries, I decided to build a custom gossip protocol for sharing ring data across the cluster. Why?
Control over metadata: I need fine-grained control over schema information, collection metadata, and ring state. Using a custom gossip lets me optimize for my specific needs.
Multi-AZ future: I’m currently using :rpc for inter-node communication, but I’ve abstracted it behind an interface. This will let me swap in different transport mechanisms for multi-availability-zone deployments where cross-AZ latency matters.
defmodule MyDB.Cluster.Gossip do # Custom gossip spreads ring state, schema changes, and membership # Abstracted communication layer for future multi-AZ support def propagate_ring_update(ring_state) do peers = get_gossip_peers() Enum.each(peers, fn peer -> # Currently :rpc, but abstracted for future transport changes MyDB.Transport.call(peer, __MODULE__, :receive_ring_update, [ring_state]) end) endend
This is essentially rebuilding parts of Riak Core, but with better control over metadata handling — something Riak struggled with.
Partisan: Not Yet, But Maybe Later
For now, I’m sticking with Erlang’s built-in distribution. It works well up to ~100 nodes with full mesh connectivity. Beyond that, the network chatter becomes problematic — every node talking to every other node doesn’t scale linearly.
Partisan offers pluggable topologies (partial mesh, client-server, etc.) that would be necessary for very large deployments. It’s on the roadmap for when/if we need to scale past 100 nodes, but premature optimization is the root of all evil. Start simple, scale when needed.
Hybrid CP/AP: The Best of Both Worlds?
Here’s the core architectural decision: the metadata plane is CP (consistent), the data plane is AP (available).
Metadata Plane (CP via Raft):
- Schema definitions
- Collection/table structures
- Ring state and token assignments
- Cluster membership authoritative state
I’m using a Raft cluster for metadata consensus. This avoids the problems Cassandra has where schema changes can conflict during network partitions. When you need to add a column or change a schema, consistency matters more than availability.
Data Plane (AP):
- Actual document reads and writes
- Eventual consistency with tunable quorums
- Vector clocks for conflict resolution
- High availability even during partitions
For data operations, availability wins. If a network partition happens, both sides should continue serving requests. We’ll reconcile conflicts later using vector clocks and CRDTs where applicable.
This hybrid approach means:
- Schema changes are safe and consistent
- Data operations are always available
- No single point of failure for reads/writes
- Clear separation of concerns
How This Maps to Database Operations
Let’s connect this to actual database functionality:
Write Path (Data Plane — AP)
- Client connects to any node (libcluster ensures cluster membership)
- Consistent hashing determines which vnodes own the key
- Custom gossip provides up-to-date ring information
:rpccall to vnode processes (abstracted for future multi-AZ)- Vnodes write to local RocksDB with vector clocks
- Replicas receive writes asynchronously
- Quorum of replicas acknowledge (tunable: W=1, W=2, W=all)
- Return success to client
Read Path (Data Plane — AP)
- Client connects to any node
- Consistent hashing finds responsible vnodes
- Read from R replicas (tunable: R=1 for speed, R=quorum for consistency)
- If conflicts detected (different vector clocks), resolve via last-write-wins or application logic
- Return data to client
Schema Change (Metadata Plane — CP)
- Client submits schema change to any node
- Node forwards to Raft leader
- Raft cluster achieves consensus (majority agreement required)
- Schema change is committed to Raft log
- Change propagated to all nodes via Raft
- Custom gossip ensures all nodes have updated ring/schema state
- System continues with new schema
Handling Node Failure
- BEAM’s monitoring detects node down
- Gossip protocol updates cluster membership
- Raft cluster elects new leader if needed (for metadata)
- Vnodes’ data remains available on replica nodes
- Hinted handoff ensures failed node catches up when it returns
- System keeps serving requests with degraded redundancy
The Trade-offs: What Elixir Doesn’t Give You
Be honest about limitations:
Performance ceiling: Erlang isn’t as fast as C++ or Rust for raw throughput. But for I/O-bound workloads (which databases are), it’s plenty fast, and the fault tolerance wins.
Memory overhead: Those millions of processes each use some memory. For very memory-constrained environments, this matters.
Learning curve: OTP patterns are different. Supervision trees and GenServers take time to understand deeply.
**Serialization overhead: **Remote calls serialize/deserialize data. For very large payloads, this adds latency.
Default distribution limits: Erlang distribution fully connects all nodes. Past ~100 nodes, Partisan or custom solutions become necessary to reduce network chatter.
The common thread? There’s no perfect database. Every choice trades off consistency, availability, latency, operational complexity, and developer experience. Understanding these trade-offs deeply — by building a database from scratch — makes you better equipped to make those architectural decisions.
Why This Matters for Database Building
Traditional databases are written in C/C++/Java/Go and implement their own:
- Thread pools
- Connection management
- Process supervision
- RPC protocols
- Cluster membership
- Failure detection
With Elixir, you get all of this for free. You can focus on:
- Storage engine integration
- Query optimization
- Conflict resolution strategies
- Consistency models
- The actual hard problems
That’s the difference between a 2-year project and a 6-month one.
Next Steps
In the next post, we’ll dive into the storage layer — why RocksDB, how to integrate it with Elixir processes, and the trade-offs between different storage engines.
We’ll also look at actual code: wrapping RocksDB in a GenServer, handling write-ahead logs, and implementing proper shutdown procedures so we don’t corrupt data.
Questions for Discussion:
- Have you used Elixir for stateful systems? What surprised you?
- What do you think about the hybrid CP/AP approach? Is separating metadata and data planes the right call?
- Have you scaled databases past 100 nodes? What were the bottlenecks?
- Is rebuilding Riak Core worth it for better metadata control, or should I just fork Riak Core?
Let me know in the comments!