Jan 04 ·
11 Min Read
It’s 3am. Production is down. You’re staring at a log line that says:
Error: serialization error: expected ',' or '}' at line 3, column 7
You know JSON is broke. But you have zero idea why, where, or who caused it. Was it the config loader? The user API? The webhook consumer?
The error has successfully bubbled up through 20 layers of your stack, preserving its original message perfectly, yet losing every scrap of meaning along the way.
We have a name for this. We call it “Error Handling.” But in reality, it’s just Error Forwarding. We treat errors like hot potatoes—catch them, wrap them (maybe), and throw them up the stack as fast as possible.
You add a println!, restart the service, wait for the bug to reproduce. It’s going …
Jan 04 ·
11 Min Read
It’s 3am. Production is down. You’re staring at a log line that says:
Error: serialization error: expected ',' or '}' at line 3, column 7
You know JSON is broke. But you have zero idea why, where, or who caused it. Was it the config loader? The user API? The webhook consumer?
The error has successfully bubbled up through 20 layers of your stack, preserving its original message perfectly, yet losing every scrap of meaning along the way.
We have a name for this. We call it “Error Handling.” But in reality, it’s just Error Forwarding. We treat errors like hot potatoes—catch them, wrap them (maybe), and throw them up the stack as fast as possible.
You add a println!, restart the service, wait for the bug to reproduce. It’s going to be a long night.
As noted in a detailed analysis of error handling in a large Rust project:
“There’re tons of opinionated articles or libraries promoting their best practices, leading to an epic debate that never ends. We were all starting to notice that there was something wrong with the error handling practices, but pinpointing the exact problems is challenging.”
What’s Wrong with Current Practices
The std::error::Error Trait: A Noble but Flawed Abstraction
Rust’s std::error::Error trait assumes errors form a chain—each error has an optional source() pointing to the underlying cause. This works for most cases; the vast majority of errors have no source or a single one.
But as a standard library abstraction, it’s too opinionated. It categorically excludes cases where sources form a tree: a validation error with multiple field failures, a timeout with partial results. These scenarios exist, and the standard trait offers no way to represent them.
Backtraces: Expensive Medicine for the Wrong Disease
Rust’s std::backtrace::Backtrace was meant to improve error observability. They’re better than nothing. But they have serious limitations:
In async code, they’re nearly useless. Your backtrace will contain 49 stack frames, of which 12 are calls to GenFuture::poll(). The Async Working Group notes that suspended tasks are invisible to traditional stack traces.
They only show the origin, not the path. A backtrace tells you where the error was created, not the logical path it took through your application. It won’t tell you “this was the request handler for user X, calling service Y, with parameters Z.”
Capturing backtraces is expensive. The standard library documentation acknowledges: “Capturing a backtrace can be a quite expensive runtime operation.”
The Provide/Request API: Overengineering in Action
The Provider API (RFC 3192) and generic member access (RFC 2895) add dynamic type-based data access to errors:
fn provide<'a>(&'a self, request: &mut Request<'a>) { request.provide_ref::<Backtrace>(&self.backtrace);}
The unstable Provide/Request API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime.
This sounds powerful. In practice, it introduces new problems:
Unpredictability: Your error might provide an HTTP status code. Or it might not. You won’t know until runtime.
Complexity: The API is subtle enough that LLVM struggles to optimize multiple provide calls.
Sometimes, a simple struct with named fields is better than a clever abstraction.
thiserror: Categorizing by Origin, Not by Action
thiserror makes it easy to define error enums:
#[derive(Debug, thiserror::Error)]pub enum DatabaseError { #[error("connection failed: {0}")] Connection(#[from] ConnectionError), #[error("query failed: {0}")] Query(#[from] QueryError), #[error("serialization failed: {0}")] Serde(#[from] serde_json::Error),}
This looks reasonable. But notice how this common practice categorizes errors: by origin, not by what the caller can do about it.
When you receive a DatabaseError::Query, what should you do? Retry? Report to the user? Log and continue? The error doesn’t tell you. It just tells you which dependency failed.
As one blogger aptly put it: “This error type does not tell the caller what problem you are solving but how you solve it.”
anyhow: So Convenient You’ll Forget to Add Context
anyhow takes the opposite approach: type erasure. Just use anyhow::Result<T> everywhere and propagate with ?. No more enum variants, no more #[from] annotations.
The problem? It’s too convenient.
fn process_request(req: Request) -> anyhow::Result<Response> { let user = db.get_user(req.user_id)?; let data = fetch_external_api(user.api_key)?; let result = compute(data)?; Ok(result)}
Every ? is a missed opportunity to add context. What was the user ID? What API were we calling? What computation failed? The error knows none of this.
The anyhow documentation encourages using .context() to add information. But .context() is optional—the type system doesn’t require it. “I’ll add context later” is the easiest lie to tell yourself. Later means never—until 3am when production is on fire.
The Problem: Error Handling Without Purpose
Consider this common pattern in Rust codebases:
#[derive(thiserror::Error, Debug)]pub enum ServiceError { #[error("database error: {0}")] Database(#[from] sqlx::Error), #[error("http error: {0}")] Http(#[from] reqwest::Error), #[error("serialization error: {0}")] Serde(#[from] serde_json::Error), // ... ten more variants}
This looks reasonable. But ask yourself:
What can the caller do with ServiceError::Database? Can they retry? Should they show the raw SQL error to users? The error type doesn’t help answer these questions.
1.
When debugging at 3 AM, does “serialization error: expected , or }” tell you which request, which field, which code path led here?
This is the fundamental disconnect in how we think about error handling. We focus on propagating errors exactly, on making the types line up, on satisfying the compiler. But we forget that errors are messages—messages that will eventually be read by either a machine trying to recover, or a human trying to debug.
The “Library vs Application” Myth
You’ve probably heard the conventional wisdom: “Use thiserror for libraries, anyhow for applications.”
It’s a nice, simple rule, just not quite right. As Luca Palmieri notes: “It is not the right framing. You need to reason about intent.”
The real question isn’t whether you’re writing a library or an application. The real question is: what do you expect the caller to do with this error?
Two Audiences, Two Needs
Let’s be explicit about who consumes errors and what they need:
| Audience | Goal | Needs |
|---|---|---|
| Machines | Automated recovery | Flat structure, clear error kinds, predictable codes |
| Humans | Debugging | Rich context, call path, business-level information |
When a retry middleware receives an error, it doesn’t care about your beautifully nested error chain. It just needs to know: is this retryable? A simple boolean or enum variant suffices.
When you’re debugging at 3am, you don’t need to know that somewhere deep in the stack there was an io::Error. You need to know: which file, which user, which request, what were we trying to do?
Most error handling designs optimize for neither audience. They optimize for the compiler.
For Machines: Flat, Actionable, Kind-Based
When errors need to be handled programmatically, complexity is the enemy. Your retry logic doesn’t want to traverse a nested error chain checking for specific variants. It wants to ask: is_retryable()?
Here’s a pattern that works, drawn from Apache OpenDAL’s error design:
pub struct Error { kind: ErrorKind, message: String, status: ErrorStatus, operation: &'static str, context: Vec<(&'static str, String)>, source: Option<anyhow::Error>,}pub enum ErrorKind { NotFound, PermissionDenied, RateLimited, // ... categorized by what the caller CAN DO}pub enum ErrorStatus { Permanent, // Don't retry Temporary, // Safe to retry Persistent, // Was retried, still failing}
This design enables clear decision-making:
// Caller can make informed decisionsmatch result { Err(e) if e.kind() == ErrorKind::RateLimited && e.is_temporary() => { sleep(Duration::from_secs(1)).await; retry().await } Err(e) if e.kind() == ErrorKind::NotFound => { create_default().await } Err(e) => return Err(e), Ok(v) => v,}
Notice the key design decisions:
ErrorKind is categorized by response, not origin. NotFound means “the thing doesn’t exist, don’t retry.” RateLimited means “slow down and try again.” The caller doesn’t need to know whether it was an S3 404 or a filesystem ENOENT—they need to know what to do about it.
ErrorStatus is explicit. Instead of guessing retryability from error types, it’s a first-class field. Services can mark errors as temporary when they know a retry might help.
One Error type per library. Instead of scattering error enums across modules, a single flat structure keeps things simple. The context field provides all the specificity you need without type proliferation.
No more traversing error chains, no more guessing from error types. Just ask the error directly.
For Humans: Low-Friction Context Capture
The biggest enemy of good error context isn’t capability—it’s friction. If adding context is annoying, developers won’t do it.
The exn library (841 lines of Rust, zero dependencies) demonstrates one approach: errors form a tree of frames, each automatically capturing its source location via #[track_caller]. Unlike linear error chains, trees can represent multiple causes—useful when parallel operations fail or validation produces multiple errors.
Here’s what we need:
Automatic location capture. Instead of expensive backtraces, use #[track_caller] to capture file/line/column at zero cost. Every error frame should know where it was created.
Ergonomic context addition. The API for adding context should be so natural that not adding it feels wrong:
fetch_user(user_id) .or_raise(|| AppError(format!("failed to fetch user {user_id}")))?;
Compare this to thiserror, where adding the same context requires defining a new variant and manual wrapping:
#[derive(thiserror::Error, Debug)]pub enum AppError { #[error("failed to fetch user {user_id}: {source}")] FetchUser { user_id: String, #[source] source: DbError, }, // ... one variant per call site that needs context}fn fetch_user(user_id: &str) -> Result<User, AppError> { db.query(user_id).map_err(|e| AppError::FetchUser { user_id: user_id.to_string(), source: e, })?}
Enforce context at module boundaries. This is where exn differs critically from anyhow. With anyhow, every error is erased to anyhow::Error, so you can always use ? and move on—the type system won’t stop you. The context methods exist, but but nothing prevents you from ignoring them.
exn takes a different approach: Exn<E> preserves the outermost error type. If your function returns Result<T, Exn<ServiceError>>, you can’t directly ? a Result<U, Exn<DatabaseError>>—the types don’t match. The compiler forces you to call or_raise() and provide a ServiceError, which is exactly the moment you should be adding context about what your module was trying to do.
// This won't compile--type mismatch forces you to add contextpub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> { let user = db.query(user_id)?; // Error: expected Exn<ServiceError>, found Exn<DbError> Ok(user)}// You must provide context at the boundarypub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> { let user = db.query(user_id) .or_raise(|| ServiceError(format!("failed to query user {user_id}")))?; // Now it compiles Ok(user)}
The type system becomes your ally: it won’t let you be lazy at module boundaries.
Here’s what this looks like in practice:
pub async fn execute(&self, task: Task) -> Result<Output, ExecutorError> { let make_error = || ExecutorError(format!("failed to execute task {}", task.id)); let user = self.fetch_user(task.user_id) .await .or_raise(make_error)?; let result = self.process(user) .or_raise(make_error)?; Ok(result)}
Every ? has context. When this fails at 3am, instead of the cryptic serialization error, you see:
failed to execute task 7829, at src/executor.rs:45:12||-> failed to fetch user "John Doe", at src/executor.rs:52:10||-> connection refused, at src/client.rs:89:24
Now you know: it was task-7829, we were fetching data, and the connection was refused. You can grep for that task ID in your request logs and find everything you need.
Putting It Together
In real systems, you often need both: machine-readable errors for automated recovery, and human-readable context for debugging. The pattern: use a flat, kind-based error type (like Apache OpenDAL’s) for the structured data, and wrap it in a context-tracking mechanism for propagation.
// Machine-oriented: flat struct with statuspub struct StorageError { pub status: ErrorStatus, pub message: String,}// Human-oriented: propagate with context at each layerpub async fn save_document(doc: Document) -> Result<(), Exn<StorageError>> { let data = serialize(&doc) .or_raise(|| StorageError::permanent("serialization failed"))?; storage.write(&doc.path, data) .await .or_raise(|| StorageError::temporary("write failed"))?; Ok(())}
At the boundary, walk the error tree to find the structured error:
// Extract a typed error from anywhere in the treefn find_error<T>(exn: &Exn<impl Error>) -> Option<&T> { fn walk<T>(frame: &Frame) -> Option<&T> { if let Some(e) = frame.as_any().downcast_ref::<T>() { return Some(e); } frame.children().iter().find_map(walk) } walk(exn.as_frame())}match save_document(doc).await { Ok(()) => Ok(()), Err(report) => { // For humans: log the full context tree log::error!("{:?}", report); // For machines: find and handle the structured error if let Some(err) = find_error::<StorageError>(&report) { if err.status == ErrorStatus::Temporary { return queue_for_retry(report); } return Err(map_to_http_status(err.kind)); } Err(StatusCode::INTERNAL_SERVER_ERROR) }}
Yes, you need to walk the tree—but compare this to the Provide/Request API. Here, you’re looking for a concrete type like StorageError. It has named fields. It has documentation. Your IDE can autocomplete it. No guessing, no runtime surprises—just a well-defined struct you can reason about and maintain.
Conclusion
The next time you write a function, look at the Result return type.
Don’t think of it as “I might fail.” Think of it as “I might need to explain myself.”
If your error type can’t answer “Should I retry?”—you failed the Machine. If your error logs don’t answer “Which user was it?”—you failed the Human.
Errors aren’t just failure modes to be propagated. They’re communication. They’re the messages your system sends when things go wrong. And like any communication, they deserve to be designed.
Stop forwarding errors. Start designing them.
Resources
- OpenDAL Error Design RFC
- OpenDAL’s Error Handling Practices
- exn: Context-aware errors for Rust
- Error Handling in Large Rust Projects (GreptimeDB)
- A Guide to Error Handling that Just Works
- Study of std::io::Error
- Error Handling In Rust - A Deep Dive
- Tracking Issue for Provider API
- Async Stack Traces Working Group
Last edited Jan 04