Building a Distributed Database in Elixir (Part 2: Architecture and the Beam)

8 min read15 hours ago

–

Press enter or click to view image in full size

The BEAM: A Virtual Machine Built for Distribution

Before we dive into database specifics, let’s talk about why Elixir is uniquely positioned for this challenge. Elixir runs on the BEAM (Bogdan/Björn’s Erlang Abstract Machine), a runtime that was designed in the 1980s for telecom systems that could never go down.

That heritage matters. Telecom switches need:

Massive concurrency (handling millions of simultaneous calls)
Fault isolation (one dropped call shouldn’t crash the system)
Hot code reloading (you can’t shut down a phone network for upgrades)
Distribution (switches are inherently distributed)

Sound familiar? These are exactly the properties we need for a distributed database.

##…

8 min read15 hours ago

–

Press enter or click to view image in full size

The BEAM: A Virtual Machine Built for Distribution

That heritage matters. Telecom switches need:

Massive concurrency (handling millions of simultaneous calls)
Fault isolation (one dropped call shouldn’t crash the system)
Hot code reloading (you can’t shut down a phone network for upgrades)
Distribution (switches are inherently distributed)

Sound familiar? These are exactly the properties we need for a distributed database.

Processes: Lightweight and Isolated

In Elixir, everything runs in processes. Not OS processes — these are BEAM processes, which are:

Extremely lightweight: ~2KB of memory each, you can run millions on a single machine
Completely isolated: one process crashing doesn’t affect others
Share nothing: communication happens via message passing
Scheduled fairly: the BEAM preempts long-running processes

For a database, this means:

Each client connection can be its own process
Each shard or partition can be managed by dedicated processes
Query execution can be parallelized trivially
Failures are isolated — a bad query crashes its process, not your database

# A simple example: each key-value pair could be its own processdefmodule KVStore do  use GenServer  def start_link(key) do    GenServer.start_link(__MODULE__, %{}, name: via_tuple(key))  end  defp via_tuple(key) do    {:via, Registry, {KVRegistry, key}}  end  def put(key, value) do    GenServer.call(via_tuple(key), {:put, value})  end  def get(key) do    GenServer.call(via_tuple(key), :get)  endend

OTP: The Framework We Don’t Have to Build

OTP (Open Telecom Platform — ignore the name, it’s generic now) is the secret sauce. It provides battle-tested patterns for building fault-tolerant systems.

Supervision Trees: Fault Tolerance by Design

A supervisor is a process that monitors other processes and restarts them when they crash. You compose these into trees that define your system’s fault tolerance strategy.

defmodule MyDB.Supervisor do  use Supervisor  def start_link(init_arg) do    Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)  end  def init(_init_arg) do    children = [      # Storage layer processes      {MyDB.Storage.Supervisor, []},            # Cluster membership and gossip      {MyDB.Cluster.Membership, []},            # Metadata Raft cluster (CP for schemas, collections, ring state)      {MyDB.Metadata.RaftCluster, []},            # Query coordinator      {MyDB.Query.Coordinator, []},            # Replication manager (AP for data plane)      {MyDB.Replication.Manager, []},            # Anti-entropy (for eventual consistency)      {MyDB.AntiEntropy, []}    ]    Supervisor.init(children, strategy: :one_for_one)  endend

What’s beautiful here is the restart strategies:

:one_for_one - if one child dies, restart only that child
:one_for_all - if one child dies, restart all children
:rest_for_one - if a child dies, restart it and all children started after it

For databases, this means defining exactly how your system recovers from failures. Storage process crashes? Restart it and reload from disk. Entire shard coordinator fails? Restart the whole coordination subsystem.

GenServer: State Machines Made Easy

GenServer (Generic Server) is the pattern for building stateful processes. Most database components map naturally to GenServers:

defmodule MyDB.Shard do  use GenServer  # Shard state includes:  # - Local RocksDB handle  # - Replication log position  # - Read/write locks  # - Pending transactions  def init(shard_id) do    {:ok, db} = RocksDB.open("/data/shard_#{shard_id}")        state = %{      shard_id: shard_id,      db: db,      replicas: [],      version_vector: %{}    }        {:ok, state}  end  def handle_call({:get, key}, _from, state) do    case RocksDB.get(state.db, key) do      {:ok, value} -> {:reply, {:ok, value}, state}      :not_found -> {:reply, :not_found, state}    end  end  def handle_call({:put, key, value}, _from, state) do    :ok = RocksDB.put(state.db, key, value)    new_state = update_version_vector(state)    replicate_to_peers(key, value, state.replicas)    {:reply, :ok, new_state}  endend

Native Clustering: Distribution Without Libraries

Here’s where it gets interesting. The BEAM has built-in distribution. You can connect nodes together and they can communicate transparently:

# Start a nodeiex --name db1@192.168.1.10 --cookie secret_token# Connect to another nodeNode.connect(:"db2@192.168.1.11")# Now you can call functions on remote nodesNode.list()  # [:db2@192.168.1.11, :db3@192.168.1.12]# Spawn a process on a remote nodepid = Node.spawn(:"db2@192.168.1.11", fn ->   IO.puts("Running on db2")end)# Send messages to processes anywhere in the clustersend({:shard_manager, :"db2@192.168.1.11"}, {:replicate, key, value})

This is huge for building distributed databases. You don’t need to:

Serialize/deserialize messages manually
Implement RPC protocols
Manage connection pools
Handle reconnection logic (the BEAM does it)

The BEAM’s distribution gives you:

Transparent remote calls: GenServer.call({:shard_1, :node2}, :get) works whether the process is local or remote
Location transparency: processes have PIDs that work across nodes
Automatic reconnection: nodes reconnect automatically after network partitions
Built-in authentication: the shared cookie mechanism

The Ecosystem: Standing on Giants’ Shoulders

Elixir’s ecosystem has matured with libraries specifically for distributed systems:

libcluster: Automatic Cluster Formation

libcluster handles cluster topology and node discovery:

config :libcluster,  topologies: [    mydb_cluster: [      strategy: Cluster.Strategy.Gossip,      config: [        port: 45892,        multicast_addr: "230.1.1.251",        # Or use DNS, Kubernetes, Consul, etc.      ]    ]  ]

This solves the “how do nodes find each other?” problem. In production, you might use:

Kubernetes: nodes discover via K8s API
DNS: query for service records
Consul/etcd: service discovery
Gossip: multicast for local networks

Phoenix.PubSub: Cluster-Wide Messaging

Phoenix.PubSub provides pub/sub across your cluster:

# Subscribe to replication eventsPhoenix.PubSub.subscribe(MyDB.PubSub, "shard:#{shard_id}")# Publish to all nodesPhoenix.PubSub.broadcast(  MyDB.PubSub,  "shard:#{shard_id}",  {:replicate, key, value, version})

This is perfect for:

Change data capture (CDC)
Notifying replicas about writes
Coordinating distributed transactions
Gossip protocols for cluster state

The beautiful part: it works the same whether you have 2 nodes or 200.

Other Ecosystem Gems

:pg: Built-in process groups for clustering
Horde: Distributed process registry with CRDT-based conflict resolution (though I’m not using it — more on that below)
Swarm: Another distributed process registry with handoff support
Riak Core: The framework from Riak DB, implementing consistent hashing and vnodes
Nebulex: Distributed caching library with various backends
ExHashRing: Consistent hashing implementations
Partisan: Alternative distribution layer for large clusters (not using yet, but keeping an eye on it)

My Architecture Decisions: Custom Gossip and Hybrid Consistency

I’m taking a somewhat unconventional approach that’s worth explaining:

Custom Gossip Instead of Horde

While Horde is excellent for distributed process registries, I decided to build a custom gossip protocol for sharing ring data across the cluster. Why?

Control over metadata: I need fine-grained control over schema information, collection metadata, and ring state. Using a custom gossip lets me optimize for my specific needs.

Multi-AZ future: I’m currently using :rpc for inter-node communication, but I’ve abstracted it behind an interface. This will let me swap in different transport mechanisms for multi-availability-zone deployments where cross-AZ latency matters.

defmodule MyDB.Cluster.Gossip do  # Custom gossip spreads ring state, schema changes, and membership  # Abstracted communication layer for future multi-AZ support    def propagate_ring_update(ring_state) do    peers = get_gossip_peers()        Enum.each(peers, fn peer ->      # Currently :rpc, but abstracted for future transport changes      MyDB.Transport.call(peer, __MODULE__, :receive_ring_update, [ring_state])    end)  endend

This is essentially rebuilding parts of Riak Core, but with better control over metadata handling — something Riak struggled with.

Partisan: Not Yet, But Maybe Later

For now, I’m sticking with Erlang’s built-in distribution. It works well up to ~100 nodes with full mesh connectivity. Beyond that, the network chatter becomes problematic — every node talking to every other node doesn’t scale linearly.

Partisan offers pluggable topologies (partial mesh, client-server, etc.) that would be necessary for very large deployments. It’s on the roadmap for when/if we need to scale past 100 nodes, but premature optimization is the root of all evil. Start simple, scale when needed.

Hybrid CP/AP: The Best of Both Worlds?

Here’s the core architectural decision: the metadata plane is CP (consistent), the data plane is AP (available).

Metadata Plane (CP via Raft):

Schema definitions
Collection/table structures
Ring state and token assignments
Cluster membership authoritative state

I’m using a Raft cluster for metadata consensus. This avoids the problems Cassandra has where schema changes can conflict during network partitions. When you need to add a column or change a schema, consistency matters more than availability.

Data Plane (AP):

Actual document reads and writes
Eventual consistency with tunable quorums
Vector clocks for conflict resolution
High availability even during partitions

For data operations, availability wins. If a network partition happens, both sides should continue serving requests. We’ll reconcile conflicts later using vector clocks and CRDTs where applicable.

This hybrid approach means:

Schema changes are safe and consistent
Data operations are always available
No single point of failure for reads/writes
Clear separation of concerns

How This Maps to Database Operations

Let’s connect this to actual database functionality:

Write Path (Data Plane — AP)

Client connects to any node (libcluster ensures cluster membership)
Consistent hashing determines which vnodes own the key
Custom gossip provides up-to-date ring information
:rpc call to vnode processes (abstracted for future multi-AZ)
Vnodes write to local RocksDB with vector clocks
Replicas receive writes asynchronously
Quorum of replicas acknowledge (tunable: W=1, W=2, W=all)
Return success to client

Read Path (Data Plane — AP)

Client connects to any node
Consistent hashing finds responsible vnodes
Read from R replicas (tunable: R=1 for speed, R=quorum for consistency)
If conflicts detected (different vector clocks), resolve via last-write-wins or application logic
Return data to client

Schema Change (Metadata Plane — CP)

Client submits schema change to any node
Node forwards to Raft leader
Raft cluster achieves consensus (majority agreement required)
Schema change is committed to Raft log
Change propagated to all nodes via Raft
Custom gossip ensures all nodes have updated ring/schema state
System continues with new schema

Handling Node Failure

BEAM’s monitoring detects node down
Gossip protocol updates cluster membership
Raft cluster elects new leader if needed (for metadata)
Vnodes’ data remains available on replica nodes
Hinted handoff ensures failed node catches up when it returns
System keeps serving requests with degraded redundancy

The Trade-offs: What Elixir Doesn’t Give You

Be honest about limitations:

Performance ceiling: Erlang isn’t as fast as C++ or Rust for raw throughput. But for I/O-bound workloads (which databases are), it’s plenty fast, and the fault tolerance wins.

Memory overhead: Those millions of processes each use some memory. For very memory-constrained environments, this matters.

Learning curve: OTP patterns are different. Supervision trees and GenServers take time to understand deeply.

**Serialization overhead: **Remote calls serialize/deserialize data. For very large payloads, this adds latency.

Default distribution limits: Erlang distribution fully connects all nodes. Past ~100 nodes, Partisan or custom solutions become necessary to reduce network chatter.

The common thread? There’s no perfect database. Every choice trades off consistency, availability, latency, operational complexity, and developer experience. Understanding these trade-offs deeply — by building a database from scratch — makes you better equipped to make those architectural decisions.

Why This Matters for Database Building

Traditional databases are written in C/C++/Java/Go and implement their own:

Thread pools
Connection management
Process supervision
RPC protocols
Cluster membership
Failure detection

With Elixir, you get all of this for free. You can focus on:

Storage engine integration
Query optimization
Conflict resolution strategies
Consistency models
The actual hard problems

That’s the difference between a 2-year project and a 6-month one.

Next Steps

In the next post, we’ll dive into the storage layer — why RocksDB, how to integrate it with Elixir processes, and the trade-offs between different storage engines.

We’ll also look at actual code: wrapping RocksDB in a GenServer, handling write-ahead logs, and implementing proper shutdown procedures so we don’t corrupt data.

Questions for Discussion:

Have you used Elixir for stateful systems? What surprised you?
What do you think about the hybrid CP/AP approach? Is separating metadata and data planes the right call?
Have you scaled databases past 100 nodes? What were the bottlenecks?
Is rebuilding Riak Core worth it for better metadata control, or should I just fork Riak Core?

Let me know in the comments!

The BEAM: A Virtual Machine Built for Distribution

The BEAM: A Virtual Machine Built for Distribution

Processes: Lightweight and Isolated

OTP: The Framework We Don’t Have to Build

Supervision Trees: Fault Tolerance by Design

GenServer: State Machines Made Easy

Native Clustering: Distribution Without Libraries

The Ecosystem: Standing on Giants’ Shoulders

libcluster: Automatic Cluster Formation

Phoenix.PubSub: Cluster-Wide Messaging

Other Ecosystem Gems

My Architecture Decisions: Custom Gossip and Hybrid Consistency

Custom Gossip Instead of Horde

Partisan: Not Yet, But Maybe Later

Hybrid CP/AP: The Best of Both Worlds?

How This Maps to Database Operations

Write Path (Data Plane — AP)

Read Path (Data Plane — AP)

Schema Change (Metadata Plane — CP)

Handling Node Failure

The Trade-offs: What Elixir Doesn’t Give You

Why This Matters for Database Building

Next Steps

Similar Posts