When starting a distributed query engine project, it’s easy to get caught up in building out SQL parsers, optimizers, or connectors first.
I chose a different approach.
Before implementing all these higher-level components, I built a minimal execution POC in Rust. The goal wasn’t to build a full engine, but to test the core execution model itself.
Why Start with the Execution Layer?
In distributed query engines, execution semantics are critical. If the execution model doesn’t work, everything else falls apart.
I wanted to make sure the execution layer could be split across multiple workers, handle shuffle, and be stable before adding any complexity. Here’s what I aimed to verify with the POC:
- Queries should be modeled as execution plans, not SQL strings
- Executi…
When starting a distributed query engine project, it’s easy to get caught up in building out SQL parsers, optimizers, or connectors first.
I chose a different approach.
Before implementing all these higher-level components, I built a minimal execution POC in Rust. The goal wasn’t to build a full engine, but to test the core execution model itself.
Why Start with the Execution Layer?
In distributed query engines, execution semantics are critical. If the execution model doesn’t work, everything else falls apart.
I wanted to make sure the execution layer could be split across multiple workers, handle shuffle, and be stable before adding any complexity. Here’s what I aimed to verify with the POC:
- Queries should be modeled as execution plans, not SQL strings
- Execution should naturally split across workers
- Shuffle should be unavoidable when working with aggregation
Why Rust for This POC?
Rust’s unique features make it a great fit for building distributed systems like this one:
Concurrency safety Execution layers are inherently concurrent. Rust’s memory model helps prevent common concurrency issues at compile-time. 1.
Async capabilities Rust’s async model is well-suited for query execution, which is often I/O-bound and relies on non-blocking computation. 1.
Long-term maintainability A distributed query engine is a long-term project. Rust helps ensure that code remains clean and safe as the system evolves.
What This POC Does (and Doesn’t Do)
What it does:
Simulates a simple execution model
Queries are modeled as operators (e.g., Scan, Filter, Aggregate)
Splits execution across multiple workers A coordinator manages worker tasks to simulate distributed execution.
Handles shuffle A minimal shuffle implementation is included to simulate data movement across workers.
What it doesn’t do:
- SQL parsing
- Cost-based optimization
- Network RPC
- Full fault tolerance
These features are vital for a real engine, but they aren’t necessary to validate the execution model.
A Minimal Execution Model
In this POC, I modeled queries as execution operators:
enum Query {
Scan { source: String },
Filter { predicate: String },
Aggregate { key: String, agg: String },
}
The execution flow is deliberately constrained:
Scan -> Filter -> Shuffle -> Aggregate
This approach keeps the focus on execution semantics and avoids getting sidetracked by syntax or optimization.
Simulating Distribution Without Networking
To avoid overcomplicating things, the POC uses:
- A Coordinator to manage worker tasks
- Multiple async workers for parallel execution
- Channels for passing intermediate results
Here’s the basic layout:
Coordinator
|
|-- Worker 1: Scan + Filter
|-- Worker 2: Scan + Filter
|
+--> Aggregate
Even without networking, the execution model is realistic, where workers operate independently and aggregate results after shuffling.
Why Shuffle is Unavoidable
Even in this minimal POC, shuffle cannot be skipped.
I used a simple rule for shuffle:
hash(key) % worker_count
This introduces:
- Data redistribution
- Synchronization points
- Performance and resource trade-offs
It’s clear that shuffle is an inherent part of the system, not just an optimization.
What This POC Proves
This POC isn’t about building a full engine. It’s about validating these core assumptions:
- The execution layer is the hardest part of a distributed query engine
- Shuffle is a system-level concern, not an optimization detail
- Rust is a good fit for these kinds of systems because it enforces clarity early on
Why Starting with a Minimal POC Matters
In distributed systems, mistakes in the execution layer lead to much bigger problems down the road.
By starting with a minimal execution model, I could identify these issues early and prevent them from accumulating as the project grew.
Conclusion
This minimal POC has helped me validate the foundational assumptions of the query engine. It’s a small step, but it’s crucial for ensuring that the execution model works before adding complexity.
If you’re building a distributed query engine or any similar system, starting small is often the most efficient way to avoid later headaches.