Error Handling Strategies and Fault Tolerance

Error handling is a system-level design decision, not a per-function one. Mixing strategies in the same codebase is the usual disaster. This page covers the available mechanisms, when each is the right choice, and how to build fault-tolerant systems on top of them.

1. The Mechanisms
2. Picking One: Decision Matrix
3. std::expected (C++23)
4. std::error_code and std::system_error
5. Errors vs Programmer Bugs
6. Fault Tolerance Patterns
7. References

1. The Mechanisms

Mechanism	Cost on success	Failure model	When natural
Exceptions	~Zero (table-based unwinding)	Stack unwinds, RAII releases	Deeply nested calls; rare failures
Error codes / `errno`	One return + check	Caller branches	C interop; embedded; predictable failures
`std::error_code`	One return + check	Strongly-typed code + category	Library boundary errors
`std::expected<T,E>`	One return + check	Pattern match on result	Modern, value-semantic; expressive
`std::optional<T>`	One return + check	Has-value vs not	Failure carries no info ("not found")
`std::variant<T,Errs...>`	One return + visit	Exhaustive match	Multi-error, sum-type style
Boolean + outparam	One return + outparam	Caller checks	C-style
`abort` / `terminate`	Zero	Process dies	Bug; unrecoverable invariant violation

2. Picking One: Decision Matrix

Constraint	Choose
Real-time / embedded with `-fno-exceptions`	`std::expected` or error codes
Library exposed across ABI boundary	`std::error_code` or C-style
Application code, deep call stack, exceptions enabled	Exceptions
Performance-critical hot path, frequent expected failures	`expected`/`optional`
Programmer bug detection (e.g. failed assertion)	`assert` or `std::abort` (not exceptions)
Cross-language (Python/Rust/Go)	C-style return code

The single biggest mistake: using exceptions for expected control flow (e.g. "key not found in map"). Exceptions are zero-cost on the success path but expensive on the throw path — orders of magnitude slower than a return.

3. `std::expected` (C++23)

std::expected<T, E> holds either a value of type T or an error of type E.

#include <charconv>
#include <expected>
#include <iostream>
#include <string>

enum class ParseError { Empty, BadDigit, OutOfRange };

std::expected<int, ParseError> parseInt(const std::string& s) {
    if (s.empty()) return std::unexpected(ParseError::Empty);
    int value = 0;
    auto [ptr, ec] = std::from_chars(s.data(), s.data() + s.size(), value);
    if (ec == std::errc::invalid_argument)    return std::unexpected(ParseError::BadDigit);
    if (ec == std::errc::result_out_of_range) return std::unexpected(ParseError::OutOfRange);
    return value;
}

int main() {
    auto r = parseInt("42");
    if (r) std::cout << *r << "\n";
    else   std::cout << "error: " << static_cast<int>(r.error()) << "\n";
}

It also supports monadic chaining (and_then, or_else, transform):

std::string input = "42";
auto result = parseInt(input)
    .transform([](int x) { return x * 2; })
    .or_else([](ParseError) { return std::expected<int, ParseError>{0}; });
// result holds 84 on success, or 0 if parseInt failed.

For C++17/20 codebases, tl::expected is a high-quality drop-in.

4. `std::error_code` and `std::system_error`

std::error_code is a small (int + category*) value designed for the library boundary: any subsystem can define a category, and consumers can compare/categorize without knowing the source.

#include <filesystem>
#include <iostream>
#include <system_error>

int main() {
    std::error_code ec;
    auto sz = std::filesystem::file_size("foo.txt", ec);
    if (ec) {
        std::cerr << ec.message() << " (cat=" << ec.category().name() << ")\n";
    } else {
        std::cout << "size = " << sz << "\n";
    }
}

You can throw the code as std::system_error if you want exception semantics:

#include <cerrno>
#include <system_error>

void openFile() {
    throw std::system_error(errno, std::generic_category(), "open failed");
}

Custom categories let you map your domain errors into the same protocol:

#include <iostream>
#include <string>
#include <system_error>

enum class NetErr { Timeout = 1, DnsFail, Refused };

class NetErrCategory : public std::error_category {
public:
    const char* name() const noexcept override { return "net"; }
    std::string message(int v) const override {
        switch (static_cast<NetErr>(v)) {
            case NetErr::Timeout: return "timeout";
            case NetErr::DnsFail: return "dns failure";
            case NetErr::Refused: return "connection refused";
        }
        return "unknown";
    }
};

int main() {
    NetErrCategory cat;
    std::cout << cat.name() << ": "
              << cat.message(static_cast<int>(NetErr::DnsFail)) << "\n";
}

5. Errors vs Programmer Bugs

These are not the same thing and must not share a mechanism.

	Error	Bug
Cause	Environment (network, disk, user input)	Wrong code (off-by-one, null deref, wrong invariant)
Recovery	Possible and intended	Impossible — code is wrong
Mechanism	exception / `expected` / `error_code`	`assert`, `std::abort`, contract violation
In release?	Yes, handled	Often `assert`-stripped, but better: keep

Catching a bug as if it were an error hides defects and ships them to production. Conversely, abort()-ing on a network timeout is poor engineering. Distinguishing these is the foundation of robust design.

6. Fault Tolerance Patterns

Beyond per-call error handling, system-level fault tolerance:

Timeouts. Every call to anything external must have one. No exceptions. (Real-Time Systems)

Retries with exponential backoff and jitter.

#include <chrono>
#include <expected>
#include <random>
#include <thread>

std::expected<int, int> call();  // returns a value or an error code

std::chrono::milliseconds jitter() {
    static std::mt19937 rng{std::random_device{}()};
    std::uniform_int_distribution<int> dist(0, 25);
    return std::chrono::milliseconds(dist(rng));
}

std::expected<int, int> callWithRetry() {
    std::chrono::milliseconds delay{50};
    for (int attempt = 0; attempt < 5; ++attempt) {
        if (auto r = call(); r) return r;
        std::this_thread::sleep_for(delay + jitter());
        delay *= 2;
    }
    return std::unexpected(-1);
}

Don't retry idempotent operations only. Don't retry forever — bound it.

Circuit breaker. Track failure rate; once a threshold is crossed, fail fast for a cooling-off window instead of dogpiling a sick service. Three states: Closed (normal), Open (fail fast), Half-Open (probe).

Bulkheads. Isolate failures: dedicate a thread pool / connection pool / memory budget per subsystem so one component going wrong can't drain shared resources.

Graceful degradation. When a non-essential subsystem fails, return a degraded result (e.g., serve cached data, skip personalization) rather than failing the whole request.

Idempotency keys. When retrying a write, the server must dedupe by key — otherwise a retry doubles the effect.

Fail-fast vs fail-soft. Choose per subsystem. The boot path fails fast (config wrong → exit). The user request path fails soft (one feature broken → still return the page).

Crash-only design. Make recovery == startup. No "graceful shutdown" path that's only exercised in production. This forces the recovery path to actually work.

7. References

Exception Safety Guarantees
Exception Handling, noexcept
Error Code
std::expected cppreference
Release It!, Michael Nygard — the canonical book on production fault tolerance.
Designing Data-Intensive Applications, Martin Kleppmann — chapter 8 on reliability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling Strategies and Fault Tolerance

1. The Mechanisms

2. Picking One: Decision Matrix

3. `std::expected` (C++23)

4. `std::error_code` and `std::system_error`

5. Errors vs Programmer Bugs

6. Fault Tolerance Patterns

7. References

FilesExpand file tree

error_handling_strategies.md

Latest commit

History

error_handling_strategies.md

File metadata and controls

Error Handling Strategies and Fault Tolerance

1. The Mechanisms

2. Picking One: Decision Matrix

3. std::expected (C++23)

4. std::error_code and std::system_error

5. Errors vs Programmer Bugs

6. Fault Tolerance Patterns

7. References

3. `std::expected` (C++23)

4. `std::error_code` and `std::system_error`