You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/post/stop-forwarding-errors-start-designing-them.mdx
+32-48Lines changed: 32 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ It's 3am. Production is down. You're staring at a log line that says:
11
11
Error: serialization error: expected ',' or '}' at line 3, column 7
12
12
```
13
13
14
-
You know JSON is broke. But you have zero idea *why*, *where*, or *who* caused it. Was it the config loader? The user API? The webhook consumer?
14
+
You know JSON is broken. But you have zero idea *why*, *where*, or *who* caused it. Was it the config loader? The user API? The webhook consumer?
15
15
16
16
The error has successfully bubbled up through 20 layers of your stack, preserving its original message perfectly, yet losing every scrap of meaning along the way.
17
17
@@ -29,15 +29,15 @@ As noted in a [detailed analysis of error handling in a large Rust project](http
29
29
30
30
### The `std::error::Error` Trait: A Noble but Flawed Abstraction
31
31
32
-
Rust's `std::error::Error` trait assumes errors form a chain--each error has an optional `source()` pointing to the underlying cause. This works for most cases; the vast majority of errors have no source or a single one.
32
+
The standard `Error` trait is built around `source()`: one error optionally points to another. That matches a lot of failures.
33
33
34
-
But as a *standard library* abstraction, it's too opinionated. It categorically excludes cases where sources form a tree: a validation error with multiple field failures, a timeout with partial results. These scenarios exist, and the standard trait offers no way to represent them.
34
+
But some of the nastiest problems aren’t a single line of causality. Validation can fail in five places at once. A batch operation can partially succeed. Timeouts can come with partial results. Those want something closer to a set or a tree of causes, not a single chain.
35
35
36
36
### Backtraces: Expensive Medicine for the Wrong Disease
37
37
38
-
Rust's `std::backtrace::Backtrace` was meant to improve error observability. They're better than nothing. But they have serious limitations:
38
+
Rust's `std::backtrace::Backtrace` was meant to improve error observability. It's better than nothing. But they have serious limitations:
39
39
40
-
**In async code, they're nearly useless.** Your backtrace will contain [49 stack frames, of which 12 are calls to `GenFuture::poll()`](https://github.com/rust-lang/rust/issues/74779). The [Async Working Group notes](https://rust-lang.github.io/wg-async/design_docs/async_stack_traces.html) that suspended tasks are invisible to traditional stack traces.
40
+
**In async code, they can be noisy or misleading.** Your backtrace will contain [49 stack frames, of which 12 are calls to `GenFuture::poll()`](https://github.com/rust-lang/rust/issues/74779). The [Async Working Group notes](https://rust-lang.github.io/wg-async/design_docs/async_stack_traces.html) that suspended tasks are invisible to traditional stack traces.
41
41
42
42
**They only show the origin, not the path.** A backtrace tells you where the error was *created*, not the logical path it took through your application. It won't tell you "this was the request handler for user X, calling service Y, with parameters Z."
The unstable `Provide`/`Request` API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime.
57
-
58
-
This sounds powerful. In practice, it introduces new problems:
56
+
The unstable `Provide`/`Request` API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime. In practice, it introduces new problems:
59
57
60
58
**Unpredictability**: Your error *might* provide an HTTP status code. Or it might not. You won't know until runtime.
61
59
62
60
**Complexity**: The API is subtle enough that [LLVM struggles to optimize multiple provide calls](https://github.com/rust-lang/rfcs/pull/3192#issuecomment-1018020335).
63
61
64
-
Sometimes, a simple struct with named fields is better than a clever abstraction.
62
+
Most of the time, a boring struct with named fields is still the thing you want.
65
63
66
64
### `thiserror`: Categorizing by Origin, Not by Action
67
65
@@ -79,17 +77,17 @@ pub enum DatabaseError {
79
77
}
80
78
```
81
79
82
-
This looks reasonable. But notice how this common practice categorizes errors: by *origin*, not by *what the caller can do about it*.
80
+
This looks reasonable. But notice how this common practice categorizes errors: by origin, not by what the caller can do about it.
83
81
84
-
When you receive a `DatabaseError::Query`, what should you do? Retry? Report to the user? Log and continue? The error doesn't tell you. It just tells you which dependency failed.
82
+
When you receive a `DatabaseError::Query`, what should you do? Retry? Report raw SQL to the user? The error doesn't tell you. It just tells you which dependency failed.
85
83
86
84
As one blogger [aptly put it](https://mmapped.blog/posts/12-rust-error-handling): "This error type does not tell the caller what problem you are solving but how you solve it."
87
85
88
86
### `anyhow`: So Convenient You'll Forget to Add Context
89
87
90
88
`anyhow` takes the opposite approach: type erasure. Just use `anyhow::Result<T>` everywhere and propagate with `?`. No more enum variants, no more `#[from]` annotations.
Every `?` is a missed opportunity to add context. What was the user ID? What API were we calling? What computation failed? The error knows none of this.
104
102
105
-
The `anyhow` documentation encourages using `.context()` to add information. But `.context()` is optional--the type system doesn't require it. "I'll add context later" is the easiest lie to tell yourself. Later means never--until 3am when production is on fire.
103
+
The `anyhow` documentation encourages using `.context()` to add information. But `.context()` is optional--the type system doesn't require it. And "I'll add context later" is the easiest lie to tell yourself.
106
104
107
105
---
108
106
@@ -123,13 +121,13 @@ pub enum ServiceError {
123
121
}
124
122
```
125
123
126
-
This looks reasonable. But ask yourself:
124
+
It looks neat, well-structured, and it compiles. But pause and ask:
127
125
128
-
1.**What can the caller do with `ServiceError::Database`?** Can they retry? Should they show the raw SQL error to users? The error type doesn't help answer these questions.
126
+
- If you are holding a `DatabaseError::Query`, is it retryable? Should you show the raw SQL error to users? The error type doesn't help answer these questions.
129
127
130
-
2.**When debugging at 3 AM**, does "serialization error: expected `,` or `}`" tell you which request, which field, which code path led here?
128
+
-When debugging, does "serialization error: expected `,` or `}`" tell you which request, which field, which code path led here?
131
129
132
-
This is the fundamental disconnect in how we think about error handling. We focus on *propagating* errors exactly, on making the types line up, on satisfying the compiler. But we forget that errors are messages--messages that will eventually be read by either a machine trying to recover, or a human trying to debug.
130
+
This is the fundamental disconnect in how we think about error handling. We focus on *propagating* errors exactly, on making the types line up. But we forget that errors are messages--messages that will eventually be read by either a machine trying to recover, or a human trying to debug.
133
131
134
132
## The "Library vs Application" Myth
135
133
@@ -141,24 +139,18 @@ The real question isn't whether you're writing a library or an application. The
141
139
142
140
## Two Audiences, Two Needs
143
141
144
-
Let's be explicit about who consumes errors and what they need:
|**Humans**| Debugging | Rich context, call path, business-level information |
150
146
151
-
When a retry middleware receives an error, it doesn't care about your beautifully nested error chain. It just needs to know: *is this retryable?* A simple boolean or enum variant suffices.
152
-
153
-
When you're debugging at 3am, you don't need to know that somewhere deep in the stack there was an `io::Error`. You need to know: *which file, which user, which request, what were we trying to do?*
154
-
155
-
Most error handling designs optimize for neither audience. They optimize for *the compiler*.
147
+
Most error handling designs optimize for neither. They optimize for *the compiler*.
156
148
157
149
### For Machines: Flat, Actionable, Kind-Based
158
150
159
151
When errors need to be handled programmatically, complexity is the enemy. Your retry logic doesn't want to traverse a nested error chain checking for specific variants. It wants to ask: `is_retryable()?`
160
152
161
-
Here's a pattern that works, drawn from [Apache OpenDAL's error design](https://github.com/apache/opendal/pull/977):
153
+
[Apache OpenDAL's error design](https://github.com/apache/opendal/pull/977) shows one way to do this:
**ErrorKind is categorized by response, not origin.**`NotFound` means "the thing doesn't exist, don't retry." `RateLimited` means "slow down and try again." The caller doesn't need to know whether it was an S3 404 or a filesystem ENOENT--they need to know what to do about it.
207
198
@@ -217,7 +208,7 @@ The biggest enemy of good error context isn't capability--it's friction. If addi
217
208
218
209
The [exn](https://github.com/fast/exn) library (294 lines of Rust, zero dependencies) demonstrates one approach: errors form a *tree* of frames, each automatically capturing its source location via `#[track_caller]`. Unlike linear error chains, trees can represent multiple causes--useful when parallel operations fail or validation produces multiple errors.
219
210
220
-
Here's what we need:
211
+
The key ingredients:
221
212
222
213
**Automatic location capture.** Instead of expensive backtraces, use `#[track_caller]` to capture file/line/column at **zero cost**. Every error frame should know where it was created.
**Enforce context at module boundaries.** This is where exn differs critically from `anyhow`. With `anyhow`, every error is erased to `anyhow::Error`, so you can always use `?` and move on--the type system won't stop you. The context methods exist, but but *nothing* prevents you from ignoring them.
244
+
**Enforce context at module boundaries.** This is where exn differs critically from `anyhow`. With `anyhow`, every error is erased to `anyhow::Error`, so you can always use `?` and move on--the type system won't stop you. The context methods exist, but *nothing* prevents you from ignoring them.
254
245
255
246
exn takes a different approach: `Exn<E>` preserves the outermost error type. If your function returns `Result<T, Exn<ServiceError>>`, you can't directly `?` a `Result<U, Exn<DatabaseError>>`--the types don't match. The compiler *forces* you to call `or_raise()` and provide a `ServiceError`, which is exactly the moment you should be adding context about what your module was trying to do.
@@ -294,8 +289,6 @@ failed to execute task 7829, at src/executor.rs:45:12
294
289
|-> connection refused, at src/client.rs:89:24
295
290
```
296
291
297
-
Now you know: it was task 7829, we were fetching user data, and the connection was refused. You can grep for that task ID in your request logs and find everything you need.
298
-
299
292
---
300
293
301
294
## Putting It Together
@@ -349,29 +342,20 @@ match save_document(doc).await {
349
342
}
350
343
returnErr(map_to_http_status(err.kind));
351
344
}
352
-
353
345
Err(StatusCode::INTERNAL_SERVER_ERROR)
354
346
}
355
347
}
356
348
```
357
349
358
-
Yes, you still need to walk the tree. But unlike the `Provide`/`Request` API, you end up with a concrete type like `StorageError`—a documented struct with named fields that your IDE can autocomplete. No guessing, no runtime surprises—just something you can reason about and maintain.
350
+
You do have to walk the tree—but compare that to the Provide/Request API. Here you’re searching for a concrete type, like `StorageError`: it has named fields, it’s documented, and your IDE can autocomplete it. No guesswork, no runtime surprises—just a well-defined struct you can understand and maintain.
359
351
360
352
---
361
353
362
-
## Conclusion
363
-
364
-
The next time you write a function, look at the `Result` return type.
365
-
366
-
Don't think of it as "I might fail."
367
-
Think of it as "I might need to explain myself."
368
-
369
-
If your error type can't answer "Should I retry?"--you failed the Machine.
370
-
If your error logs don't answer "Which user was it?"--you failed the Human.
354
+
## Closing thought
371
355
372
-
Errors aren't just failure modes to be propagated. They're communication. They're the messages your system sends when things go wrong. And like any communication, they deserve to be designed.
356
+
Propagating errors is easy in Rust. Explaining them is the part we tend to postpone.
373
357
374
-
Stop forwarding errors. Start designing them.
358
+
Next time you return a `Result`, take 30 seconds to ask: “If this fails in production, what would I wish the log said?” Then make it say that.
0 commit comments