Skip to content

Commit f5886b2

Browse files
committed
feat: production-ready 1.0.0
- Fix drain_loop recursion bug (only slept 500ms, never recursed) - Restructure shutdown: drain in-flight before deleting bulkheads - Drain monitors bulkhead occupancy, exits early when idle - Emit gate_not_ready telemetry event on mark_not_ready - Add uptime_seconds and node to health report VM info - Add gate_enabled config (false skips startup gating for dev/test) - Add health_severity config (critical returns 503 when unhealthy) - Add drain_poll_interval config (configurable drain polling) - Add unknown config key warnings on startup - Convert all binary strings to sigil syntax - Fix unused logger.hrl, catch→try/catch, direct fun reference - Add xref ignores for public API exports - Suppress atom exhaustion warnings (bounded dep set) New test suites (29 new tests, 58 total): - nova_resilience_shutdown_SUITE (7 tests) - nova_resilience_gate_SUITE (9 tests) - nova_resilience_deadline_plugin_SUITE (7 tests) - nova_resilience_telemetry_SUITE (5 tests) New guides: - Circuit Breakers & Bulkheads - Deadline Propagation - Telemetry Updated guides: Getting Started, Adapters, Shutdown, README
1 parent 68cfa2e commit f5886b2

26 files changed

+1807
-198
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ rebar.lock
66
*.ez
77
erl_crash.dump
88
.rebar3/
9+
rebar3.crashdump

README.md

Lines changed: 104 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,19 @@
22

33
Production-grade resilience patterns for [Nova](https://github.com/novaframework/nova) web applications.
44

5-
Bridges Nova and [Seki](https://github.com/Taure/seki) to provide dependency health checking, Kubernetes-ready probes, circuit breakers, bulkheads, and ordered graceful shutdown — all via declarative configuration.
5+
Bridges Nova and [Seki](https://github.com/Taure/seki) to provide dependency health checking, Kubernetes-ready probes, circuit breakers, bulkheads, deadline propagation, and ordered graceful shutdown — all via declarative configuration.
6+
7+
## Features
8+
9+
- **Health endpoints**`/health`, `/ready`, `/live` for Kubernetes probes
10+
- **Startup gating** — traffic held until critical dependencies are healthy
11+
- **Circuit breakers** — stop calling failing dependencies, allow recovery
12+
- **Bulkheads** — limit concurrent requests per dependency
13+
- **Retry** — configurable retry with exponential backoff and jitter
14+
- **Deadline propagation** — per-request timeouts via headers or defaults
15+
- **Graceful shutdown** — ordered teardown with drain, priority groups, and LB coordination
16+
- **Telemetry** — events for all resilience operations (calls, breakers, shutdown, health)
17+
- **Pluggable adapters** — built-in support for pgo, kura, brod, or custom
618

719
## Quick start
820

@@ -16,7 +28,7 @@ Add to your deps:
1628
]}.
1729
```
1830

19-
Add to your app's `applications`:
31+
Add to your `.app.src` applications:
2032

2133
```erlang
2234
{applications, [kernel, stdlib, nova, seki, nova_resilience]}.
@@ -30,7 +42,7 @@ Register health routes in your Nova config:
3042
]}.
3143
```
3244

33-
Configure dependencies:
45+
Configure dependencies in `sys.config`:
3446

3547
```erlang
3648
{nova_resilience, [
@@ -40,33 +52,40 @@ Configure dependencies:
4052
adapter => pgo,
4153
pool => default,
4254
critical => true,
55+
breaker => #{failure_threshold => 5, wait_duration => 30000},
56+
bulkhead => #{max_concurrent => 25},
4357
shutdown_priority => 2}
4458
]}
4559
]}.
4660
```
4761

48-
That's it. Your app now has `/health`, `/ready`, and `/live` endpoints, automatic startup gating, and ordered shutdown.
62+
That's it. Your app now has `/health`, `/ready`, and `/live` endpoints, automatic startup gating, circuit breakers, bulkheads, and ordered shutdown.
4963

50-
## What it does
64+
## How it works
5165

5266
### Startup
5367

54-
1. App starts, nova_resilience provisions health checks for each dependency
55-
2. `/ready` returns **503** until all critical dependencies are healthy
56-
3. Kubernetes readiness probe detects this and holds traffic
57-
4. Once all critical deps respond, `/ready` returns **200** and traffic flows
68+
1. App starts, nova_resilience provisions seki primitives for each dependency
69+
2. Health checks run — `/ready` returns **503** until all critical deps are healthy
70+
3. Kubernetes readiness probe holds traffic until ready
71+
4. Once healthy, `/ready` returns **200** and traffic flows
5872

5973
### Running
6074

61-
Execute calls through the resilience stack:
75+
Wrap calls to external dependencies through the resilience stack:
6276

6377
```erlang
6478
case nova_resilience:call(primary_db, fun() ->
65-
pgo:query(<<"SELECT * FROM users WHERE id = $1">>, [Id])
79+
pgo:query(~"SELECT * FROM users WHERE id = $1", [Id])
6680
end) of
67-
{ok, #{rows := Rows}} -> {json, #{users => Rows}};
68-
{error, circuit_open} -> {json, 503, #{}, #{error => <<"db unavailable">>}};
69-
{error, bulkhead_full} -> {json, 503, #{}, #{error => <<"overloaded">>}}
81+
{ok, #{rows := Rows}} ->
82+
{json, #{users => Rows}};
83+
{error, circuit_open} ->
84+
{json, 503, #{}, #{error => ~"db unavailable"}};
85+
{error, bulkhead_full} ->
86+
{json, 503, #{}, #{error => ~"overloaded"}};
87+
{error, deadline_exceeded} ->
88+
{json, 504, #{}, #{error => ~"timeout"}}
7089
end.
7190
```
7291

@@ -75,42 +94,96 @@ end.
7594
On SIGTERM (or application stop):
7695

7796
1. `/ready` immediately returns **503** (load balancer stops sending traffic)
78-
2. Waits `shutdown_delay` for in-flight LB health checks to propagate
79-
3. Tears down dependencies in `shutdown_priority` order
80-
4. Nova drains HTTP connections and stops
97+
2. Waits `shutdown_delay` for LB health checks to propagate
98+
3. Drains in-flight requests (monitors bulkhead occupancy)
99+
4. Tears down dependencies in `shutdown_priority` order
100+
5. Nova drains HTTP connections and stops
81101

82102
No manual `prep_stop` calls needed — shutdown is fully automatic.
83103

84104
## Health endpoints
85105

86106
| Endpoint | Purpose | Response |
87107
|----------|---------|----------|
88-
| `GET /health` | Full health report | `{"status":"healthy","dependencies":{...},"vm":{...}}` |
108+
| `GET /health` | Full diagnostic report | `{"status":"healthy","dependencies":{...},"vm":{...}}` |
89109
| `GET /ready` | Kubernetes readiness probe | 200 when ready, 503 when not |
90110
| `GET /live` | Kubernetes liveness probe | 200 if process is responsive |
91111

112+
The `/health` endpoint returns per-dependency status with circuit breaker state, bulkhead occupancy, and VM metrics (memory, process count, run queue, uptime, node).
113+
92114
## Configuration
93115

116+
### Application environment
117+
94118
```erlang
95119
{nova_resilience, [
96-
{dependencies, [...]}, %% List of dependency configs
97-
{health_check_interval, 10000}, %% ms between health checks
98-
{vm_checks, true}, %% Include BEAM VM in health report
99-
{gate_timeout, 30000}, %% Max ms to wait for deps on startup
100-
{shutdown_delay, 5000}, %% ms to wait after marking not-ready
101-
{shutdown_drain_timeout, 15000},%% Max ms to drain per priority group
102-
{health_prefix, <<"">>} %% Prefix for health routes
120+
{dependencies, [...]}, %% List of dependency configs
121+
{health_check_interval, 10000}, %% ms between health checks
122+
{vm_checks, true}, %% Include BEAM VM info in health report
123+
{gate_enabled, true}, %% false to skip startup gating (dev/test)
124+
{gate_timeout, 30000}, %% Max ms to wait for deps on startup
125+
{gate_check_interval, 1000}, %% ms between gate readiness checks
126+
{health_severity, info}, %% critical: /health returns 503 when unhealthy
127+
{shutdown_delay, 5000}, %% ms to wait after marking not-ready
128+
{shutdown_drain_timeout, 15000}, %% Max ms to drain per priority group
129+
{drain_poll_interval, 100}, %% ms between drain occupancy polls
130+
{health_prefix, ~""} %% Prefix for health routes (e.g. ~"/internal")
103131
]}.
104132
```
105133

134+
Unknown config keys are logged as warnings on startup to catch typos.
135+
136+
### Dependency config
137+
138+
```erlang
139+
#{
140+
name => atom(), %% Required — unique identifier
141+
type => database | kafka | custom, %% Optional — infers adapter
142+
adapter => pgo | kura | brod | module(), %% Optional — inferred from type
143+
critical => boolean(), %% Default: false — gates /ready
144+
shutdown_priority => non_neg_integer(), %% Default: 10 — lower = first
145+
default_timeout => pos_integer(), %% Default deadline in ms
146+
health_check => {module(), function()}, %% Override adapter health check
147+
148+
%% Circuit breaker
149+
breaker => #{
150+
failure_threshold => pos_integer(),
151+
wait_duration => pos_integer(),
152+
slow_call_duration => pos_integer(),
153+
half_open_requests => pos_integer()
154+
},
155+
156+
%% Concurrency limiter
157+
bulkhead => #{
158+
max_concurrent => pos_integer()
159+
},
160+
161+
%% Retry with backoff
162+
retry => #{
163+
max_attempts => pos_integer(),
164+
base_delay => non_neg_integer(),
165+
max_delay => non_neg_integer()
166+
}
167+
}
168+
```
169+
106170
## Built-in adapters
107171

108-
| Type | Adapter | Auto health check |
109-
|------|---------|-------------------|
110-
| `database` | `pgo` (default) | `SELECT 1` via pgo |
111-
| `database` | `kura` | `SELECT 1` via kura repo |
112-
| `kafka` | `brod` | `brod:get_partitions_count/2` |
113-
| any | custom module | Implement `nova_resilience_adapter` behaviour |
172+
| Type | Adapter | Health check | Shutdown |
173+
|------|---------|-------------|----------|
174+
| `database` | `pgo` (default) | `SELECT 1` via pgo pool | no-op |
175+
| `database` | `kura` | `SELECT 1` via kura repo | no-op |
176+
| `kafka` | `brod` | `brod:get_partitions_count/2` | `brod:stop_client/1` |
177+
| any | custom module | `nova_resilience_adapter` behaviour | custom |
178+
179+
## Guides
180+
181+
- [Getting Started](guides/getting-started.md) — Installation and basic setup
182+
- [Circuit Breakers & Bulkheads](guides/resilience-patterns.md) — Protecting dependencies
183+
- [Deadline Propagation](guides/deadlines.md) — Per-request timeout budgets
184+
- [Adapters](guides/adapters.md) — Built-in and custom adapters
185+
- [Graceful Shutdown](guides/shutdown.md) — Ordered teardown and Kubernetes integration
186+
- [Telemetry](guides/telemetry.md) — Observability and monitoring
114187

115188
## License
116189

guides/adapters.md

Lines changed: 70 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,26 @@ The `repo` field is required — it's the kura repo module that implements `kura
3232

3333
### brod (default for `kafka` type)
3434

35-
Health check calls `brod:get_partitions_count/2` to verify broker connectivity.
35+
Health check calls `brod:get_partitions_count/2` to verify broker connectivity. Shutdown calls `brod:stop_client/1`.
3636

3737
```erlang
3838
#{name => events,
3939
type => kafka,
4040
client => my_brod_client,
41-
topic => <<"events">>}
41+
topic => ~"events"}
4242
```
4343

4444
Both `client` and `topic` are required.
4545

46+
## Adapter resolution
47+
48+
nova_resilience resolves adapters in this order:
49+
50+
1. Explicit `adapter` field → use that module
51+
2. `type => database``nova_resilience_adapter_pgo`
52+
3. `type => kafka``nova_resilience_adapter_brod`
53+
4. No type or `type => custom` → no adapter (no automatic health check)
54+
4655
## Custom adapters
4756

4857
Implement the `nova_resilience_adapter` behaviour:
@@ -54,16 +63,19 @@ Implement the `nova_resilience_adapter` behaviour:
5463
-export([health_check/1, wrap_call/2, shutdown/1]).
5564

5665
health_check(#{pool := Pool}) ->
57-
case eredis:q(Pool, [<<"PING">>]) of
58-
{ok, <<"PONG">>} -> ok;
66+
case eredis:q(Pool, [~"PING"]) of
67+
{ok, ~"PONG"} -> ok;
5968
{error, Reason} -> {error, Reason}
6069
end.
6170

6271
wrap_call(_Config, Fun) ->
72+
%% Called around every nova_resilience:call/2,3
73+
%% Use for logging, tracing, connection checkout, etc.
6374
Fun().
6475

65-
shutdown(_Config) ->
66-
ok.
76+
shutdown(#{pool := Pool}) ->
77+
%% Called during graceful shutdown
78+
eredis:stop(Pool).
6779
```
6880

6981
Then reference it in your config:
@@ -77,6 +89,41 @@ Then reference it in your config:
7789
shutdown_priority => 0}
7890
```
7991

92+
### Behaviour callbacks
93+
94+
| Callback | Return | Purpose |
95+
|----------|--------|---------|
96+
| `health_check(Config)` | `ok \| {error, Reason}` | Called periodically to check dependency health |
97+
| `wrap_call(Config, Fun)` | `term()` | Wraps every call through the resilience stack |
98+
| `shutdown(Config)` | `ok` | Called during graceful shutdown |
99+
100+
The `Config` parameter is the full dependency config map, so you can pass any fields you need.
101+
102+
### wrap_call examples
103+
104+
**Connection checkout:**
105+
106+
```erlang
107+
wrap_call(#{pool := Pool}, Fun) ->
108+
case pool:checkout(Pool) of
109+
{ok, Conn} ->
110+
try Fun()
111+
after pool:checkin(Pool, Conn)
112+
end;
113+
{error, _} = Err ->
114+
Err
115+
end.
116+
```
117+
118+
**Distributed tracing:**
119+
120+
```erlang
121+
wrap_call(#{name := Name}, Fun) ->
122+
otel_tracer:with_span(Name, #{kind => client}, fun(_Ctx) ->
123+
Fun()
124+
end).
125+
```
126+
80127
## Overriding health checks
81128

82129
Any dependency can override the adapter's health check with a custom `{Module, Function}` tuple:
@@ -88,7 +135,23 @@ Any dependency can override the adapter's health check with a custom `{Module, F
88135
health_check => {my_app_health, deep_db_check}}
89136
```
90137

91-
The function must return `ok | {error, Reason}`.
138+
The function must take zero arguments and return `ok | {error, Reason}`:
139+
140+
```erlang
141+
-module(my_app_health).
142+
-export([deep_db_check/0]).
143+
144+
deep_db_check() ->
145+
case pgo:query(~"SELECT count(*) FROM pg_stat_activity") of
146+
#{rows := [[Count]]} when Count < 100 -> ok;
147+
#{rows := [[Count]]} -> {error, {too_many_connections, Count}};
148+
{error, Reason} -> {error, Reason}
149+
end.
150+
```
151+
152+
## Soft dependencies
153+
154+
All built-in adapters are soft dependencies — they're only loaded when used. Your application only needs to include the adapter libraries it actually uses (pgo, kura, brod).
92155

93156
## Runtime registration
94157

@@ -103,11 +166,9 @@ nova_resilience:register_dependency(inventory_service, #{
103166
breaker => #{failure_threshold => 5, wait_duration => 30000}
104167
}).
105168

106-
%% Then use it
107169
nova_resilience:call(inventory_service, fun() ->
108170
httpc:request("http://inventory:8080/api/stock")
109171
end).
110172

111-
%% Unregister when no longer needed
112173
nova_resilience:unregister_dependency(inventory_service).
113174
```

0 commit comments

Comments
 (0)