Skip to content

Commit d5ac24f

Browse files
authored
chore: minor changes on wording (#15)
1 parent ae510c8 commit d5ac24f

1 file changed

Lines changed: 32 additions & 48 deletions

File tree

src/content/post/stop-forwarding-errors-start-designing-them.mdx

Lines changed: 32 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ It's 3am. Production is down. You're staring at a log line that says:
1111
Error: serialization error: expected ',' or '}' at line 3, column 7
1212
```
1313

14-
You know JSON is broke. But you have zero idea *why*, *where*, or *who* caused it. Was it the config loader? The user API? The webhook consumer?
14+
You know JSON is broken. But you have zero idea *why*, *where*, or *who* caused it. Was it the config loader? The user API? The webhook consumer?
1515

1616
The error has successfully bubbled up through 20 layers of your stack, preserving its original message perfectly, yet losing every scrap of meaning along the way.
1717

@@ -29,15 +29,15 @@ As noted in a [detailed analysis of error handling in a large Rust project](http
2929

3030
### The `std::error::Error` Trait: A Noble but Flawed Abstraction
3131

32-
Rust's `std::error::Error` trait assumes errors form a chain--each error has an optional `source()` pointing to the underlying cause. This works for most cases; the vast majority of errors have no source or a single one.
32+
The standard `Error` trait is built around `source()`: one error optionally points to another. That matches a lot of failures.
3333

34-
But as a *standard library* abstraction, it's too opinionated. It categorically excludes cases where sources form a tree: a validation error with multiple field failures, a timeout with partial results. These scenarios exist, and the standard trait offers no way to represent them.
34+
But some of the nastiest problems aren’t a single line of causality. Validation can fail in five places at once. A batch operation can partially succeed. Timeouts can come with partial results. Those want something closer to a set or a tree of causes, not a single chain.
3535

3636
### Backtraces: Expensive Medicine for the Wrong Disease
3737

38-
Rust's `std::backtrace::Backtrace` was meant to improve error observability. They're better than nothing. But they have serious limitations:
38+
Rust's `std::backtrace::Backtrace` was meant to improve error observability. It's better than nothing. But they have serious limitations:
3939

40-
**In async code, they're nearly useless.** Your backtrace will contain [49 stack frames, of which 12 are calls to `GenFuture::poll()`](https://github.com/rust-lang/rust/issues/74779). The [Async Working Group notes](https://rust-lang.github.io/wg-async/design_docs/async_stack_traces.html) that suspended tasks are invisible to traditional stack traces.
40+
**In async code, they can be noisy or misleading.** Your backtrace will contain [49 stack frames, of which 12 are calls to `GenFuture::poll()`](https://github.com/rust-lang/rust/issues/74779). The [Async Working Group notes](https://rust-lang.github.io/wg-async/design_docs/async_stack_traces.html) that suspended tasks are invisible to traditional stack traces.
4141

4242
**They only show the origin, not the path.** A backtrace tells you where the error was *created*, not the logical path it took through your application. It won't tell you "this was the request handler for user X, calling service Y, with parameters Z."
4343

@@ -53,15 +53,13 @@ fn provide<'a>(&'a self, request: &mut Request<'a>) {
5353
}
5454
```
5555

56-
The unstable `Provide`/`Request` API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime.
57-
58-
This sounds powerful. In practice, it introduces new problems:
56+
The unstable `Provide`/`Request` API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime. In practice, it introduces new problems:
5957

6058
**Unpredictability**: Your error *might* provide an HTTP status code. Or it might not. You won't know until runtime.
6159

6260
**Complexity**: The API is subtle enough that [LLVM struggles to optimize multiple provide calls](https://github.com/rust-lang/rfcs/pull/3192#issuecomment-1018020335).
6361

64-
Sometimes, a simple struct with named fields is better than a clever abstraction.
62+
Most of the time, a boring struct with named fields is still the thing you want.
6563

6664
### `thiserror`: Categorizing by Origin, Not by Action
6765

@@ -79,17 +77,17 @@ pub enum DatabaseError {
7977
}
8078
```
8179

82-
This looks reasonable. But notice how this common practice categorizes errors: by *origin*, not by *what the caller can do about it*.
80+
This looks reasonable. But notice how this common practice categorizes errors: by origin, not by what the caller can do about it.
8381

84-
When you receive a `DatabaseError::Query`, what should you do? Retry? Report to the user? Log and continue? The error doesn't tell you. It just tells you which dependency failed.
82+
When you receive a `DatabaseError::Query`, what should you do? Retry? Report raw SQL to the user? The error doesn't tell you. It just tells you which dependency failed.
8583

8684
As one blogger [aptly put it](https://mmapped.blog/posts/12-rust-error-handling): "This error type does not tell the caller what problem you are solving but how you solve it."
8785

8886
### `anyhow`: So Convenient You'll Forget to Add Context
8987

9088
`anyhow` takes the opposite approach: type erasure. Just use `anyhow::Result<T>` everywhere and propagate with `?`. No more enum variants, no more `#[from]` annotations.
9189

92-
The problem? It's *too* convenient.
90+
The problem is that it's *too* convenient.
9391

9492
```rust
9593
fn process_request(req: Request) -> anyhow::Result<Response> {
@@ -102,7 +100,7 @@ fn process_request(req: Request) -> anyhow::Result<Response> {
102100

103101
Every `?` is a missed opportunity to add context. What was the user ID? What API were we calling? What computation failed? The error knows none of this.
104102

105-
The `anyhow` documentation encourages using `.context()` to add information. But `.context()` is optional--the type system doesn't require it. "I'll add context later" is the easiest lie to tell yourself. Later means never--until 3am when production is on fire.
103+
The `anyhow` documentation encourages using `.context()` to add information. But `.context()` is optional--the type system doesn't require it. And "I'll add context later" is the easiest lie to tell yourself.
106104

107105
---
108106

@@ -123,13 +121,13 @@ pub enum ServiceError {
123121
}
124122
```
125123

126-
This looks reasonable. But ask yourself:
124+
It looks neat, well-structured, and it compiles. But pause and ask:
127125

128-
1. **What can the caller do with `ServiceError::Database`?** Can they retry? Should they show the raw SQL error to users? The error type doesn't help answer these questions.
126+
- If you are holding a `DatabaseError::Query`, is it retryable? Should you show the raw SQL error to users? The error type doesn't help answer these questions.
129127

130-
2. **When debugging at 3 AM**, does "serialization error: expected `,` or `}`" tell you which request, which field, which code path led here?
128+
- When debugging, does "serialization error: expected `,` or `}`" tell you which request, which field, which code path led here?
131129

132-
This is the fundamental disconnect in how we think about error handling. We focus on *propagating* errors exactly, on making the types line up, on satisfying the compiler. But we forget that errors are messages--messages that will eventually be read by either a machine trying to recover, or a human trying to debug.
130+
This is the fundamental disconnect in how we think about error handling. We focus on *propagating* errors exactly, on making the types line up. But we forget that errors are messages--messages that will eventually be read by either a machine trying to recover, or a human trying to debug.
133131

134132
## The "Library vs Application" Myth
135133

@@ -141,24 +139,18 @@ The real question isn't whether you're writing a library or an application. The
141139

142140
## Two Audiences, Two Needs
143141

144-
Let's be explicit about who consumes errors and what they need:
145-
146142
| Audience | Goal | Needs |
147143
|----------|------|-------|
148144
| **Machines** | Automated recovery | Flat structure, clear error kinds, predictable codes |
149145
| **Humans** | Debugging | Rich context, call path, business-level information |
150146

151-
When a retry middleware receives an error, it doesn't care about your beautifully nested error chain. It just needs to know: *is this retryable?* A simple boolean or enum variant suffices.
152-
153-
When you're debugging at 3am, you don't need to know that somewhere deep in the stack there was an `io::Error`. You need to know: *which file, which user, which request, what were we trying to do?*
154-
155-
Most error handling designs optimize for neither audience. They optimize for *the compiler*.
147+
Most error handling designs optimize for neither. They optimize for *the compiler*.
156148

157149
### For Machines: Flat, Actionable, Kind-Based
158150

159151
When errors need to be handled programmatically, complexity is the enemy. Your retry logic doesn't want to traverse a nested error chain checking for specific variants. It wants to ask: `is_retryable()?`
160152

161-
Here's a pattern that works, drawn from [Apache OpenDAL's error design](https://github.com/apache/opendal/pull/977):
153+
[Apache OpenDAL's error design](https://github.com/apache/opendal/pull/977) shows one way to do this:
162154

163155
```rust
164156
pub struct Error {
@@ -184,10 +176,9 @@ pub enum ErrorStatus {
184176
}
185177
```
186178

187-
This design enables clear decision-making:
179+
Then the call site stays straightforward:
188180

189181
```rust
190-
// Caller can make informed decisions
191182
match result {
192183
Err(e) if e.kind() == ErrorKind::RateLimited && e.is_temporary() => {
193184
sleep(Duration::from_secs(1)).await;
@@ -201,7 +192,7 @@ match result {
201192
}
202193
```
203194

204-
Notice the key design decisions:
195+
A few things to note:
205196

206197
**ErrorKind is categorized by response, not origin.** `NotFound` means "the thing doesn't exist, don't retry." `RateLimited` means "slow down and try again." The caller doesn't need to know whether it was an S3 404 or a filesystem ENOENT--they need to know what to do about it.
207198

@@ -217,7 +208,7 @@ The biggest enemy of good error context isn't capability--it's friction. If addi
217208

218209
The [exn](https://github.com/fast/exn) library (294 lines of Rust, zero dependencies) demonstrates one approach: errors form a *tree* of frames, each automatically capturing its source location via `#[track_caller]`. Unlike linear error chains, trees can represent multiple causes--useful when parallel operations fail or validation produces multiple errors.
219210

220-
Here's what we need:
211+
The key ingredients:
221212

222213
**Automatic location capture.** Instead of expensive backtraces, use `#[track_caller]` to capture file/line/column at **zero cost**. Every error frame should know where it was created.
223214

@@ -250,7 +241,7 @@ fn fetch_user(user_id: &str) -> Result<User, AppError> {
250241
}
251242
```
252243

253-
**Enforce context at module boundaries.** This is where exn differs critically from `anyhow`. With `anyhow`, every error is erased to `anyhow::Error`, so you can always use `?` and move on--the type system won't stop you. The context methods exist, but but *nothing* prevents you from ignoring them.
244+
**Enforce context at module boundaries.** This is where exn differs critically from `anyhow`. With `anyhow`, every error is erased to `anyhow::Error`, so you can always use `?` and move on--the type system won't stop you. The context methods exist, but *nothing* prevents you from ignoring them.
254245

255246
exn takes a different approach: `Exn<E>` preserves the outermost error type. If your function returns `Result<T, Exn<ServiceError>>`, you can't directly `?` a `Result<U, Exn<DatabaseError>>`--the types don't match. The compiler *forces* you to call `or_raise()` and provide a `ServiceError`, which is exactly the moment you should be adding context about what your module was trying to do.
256247

@@ -271,14 +262,18 @@ pub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> {
271262

272263
The type system becomes your ally: it won't let you be lazy at module boundaries.
273264

274-
Here's what this looks like in practice:
265+
In practice:
275266

276267
```rust
277268
pub async fn execute(&self, task: Task) -> Result<Output, ExecutorError> {
278269
let make_error = || ExecutorError(format!("failed to execute task {}", task.id));
279270

280-
let user = self.fetch_user(task.user_id).await.or_raise(make_error)?;
281-
let result = self.process(user).or_raise(make_error)?;
271+
let user = self.fetch_user(task.user_id)
272+
.await
273+
.or_raise(make_error)?;
274+
275+
let result = self.process(user)
276+
.or_raise(make_error)?;
282277

283278
Ok(result)
284279
}
@@ -294,8 +289,6 @@ failed to execute task 7829, at src/executor.rs:45:12
294289
|-> connection refused, at src/client.rs:89:24
295290
```
296291

297-
Now you know: it was task 7829, we were fetching user data, and the connection was refused. You can grep for that task ID in your request logs and find everything you need.
298-
299292
---
300293

301294
## Putting It Together
@@ -349,29 +342,20 @@ match save_document(doc).await {
349342
}
350343
return Err(map_to_http_status(err.kind));
351344
}
352-
353345
Err(StatusCode::INTERNAL_SERVER_ERROR)
354346
}
355347
}
356348
```
357349

358-
Yes, you still need to walk the tree. But unlike the `Provide`/`Request` API, you end up with a concrete type like `StorageError`—a documented struct with named fields that your IDE can autocomplete. No guessing, no runtime surprises—just something you can reason about and maintain.
350+
You do have to walk the tree—but compare that to the Provide/Request API. Here you’re searching for a concrete type, like `StorageError`: it has named fields, it’s documented, and your IDE can autocomplete it. No guesswork, no runtime surprises—just a well-defined struct you can understand and maintain.
359351

360352
---
361353

362-
## Conclusion
363-
364-
The next time you write a function, look at the `Result` return type.
365-
366-
Don't think of it as "I might fail."
367-
Think of it as "I might need to explain myself."
368-
369-
If your error type can't answer "Should I retry?"--you failed the Machine.
370-
If your error logs don't answer "Which user was it?"--you failed the Human.
354+
## Closing thought
371355

372-
Errors aren't just failure modes to be propagated. They're communication. They're the messages your system sends when things go wrong. And like any communication, they deserve to be designed.
356+
Propagating errors is easy in Rust. Explaining them is the part we tend to postpone.
373357

374-
Stop forwarding errors. Start designing them.
358+
Next time you return a `Result`, take 30 seconds to ask: “If this fails in production, what would I wish the log said?” Then make it say that.
375359

376360
## Resources
377361

0 commit comments

Comments
 (0)