Skip to content

Commit fe05dca

Browse files
committed
docs: address review feedback on config builder, examples, and changelog
1 parent 314baf1 commit fe05dca

9 files changed

Lines changed: 103 additions & 248 deletions

File tree

CHANGELOG.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12-
- `Dial9Config::builder()` — a `bon`-generated fluent entry point with named setters for the required writer fields (`base_path`, `max_file_size`, `max_total_size`), stackable `with_tokio` / `with_runtime` closures, and an `.enabled(bool)` toggle that selects the no-telemetry path on the same builder. Re-exported at the crate root as `dial9_tokio_telemetry::Dial9Config` / `Dial9ConfigBuilder`. The original positional-argument API (`Dial9ConfigBuilder::new(..)` / `::disabled()` under `dial9_tokio_telemetry::config`) is unchanged and remains fully supported.
13-
- `Dial9ConfigBuilder::build()` returns `Result<Dial9Config, Dial9ConfigBuilderError>`. Required-field validation **and** the `RotatingWriter` transport-I/O probe both happen at config-build time, so by the time you have a `Dial9Config` the trace file has already been opened.
14-
- `Dial9ConfigBuilder::build_or_disabled()` — lenient counterpart to `build()`. Returns the same `Dial9Config` type; on validation or writer-I/O failure it logs at `tracing::error!(target = "dial9_telemetry")` and downgrades to a disabled config that still carries the user's `with_tokio` configurators.
15-
- `Dial9ConfigBuilderError` enum with `Validation(ValidationError)` and `Io(std::io::Error)` variants, both implementing `std::error::Error`. `ValidationError::fields()` returns the names of the unset required setters.
16-
- `TracedRuntime::new(config)` / `TracedRuntime::try_new(config)` — high-level constructors used by the `#[dial9_tokio_telemetry::main]` macro. Both accept either the new fluent `Dial9Config` or the deprecated positional `dial9_tokio_telemetry::config::Dial9Config` via `TryInto<TracedRuntime>`. `try_new` returns the conversion error (`TelemetryRuntimeError` for the fluent path, `std::io::Error` for the legacy bridge); `new` panics with that error formatted via `Display`.
17-
- `TelemetryRuntimeError` enum with `TokioRuntimeBuilder` and `TelemetryCore` variants — the only failure modes left after writer-I/O has been moved into the config builder.
18-
- `TelemetryHandle::disabled()` — explicit constructor for an inert handle whose `spawn` falls through to `tokio::spawn` and whose control methods are no-ops. `TelemetryHandle::is_enabled()` distinguishes the live and inert modes.
19-
- `TelemetryGuard::is_enabled()` and a no-op `Disabled` mode: a `TelemetryGuard` is now always present on a `TracedRuntime`, regardless of whether telemetry was installed. `TelemetryGuard::handle()` returns an inert `TelemetryHandle` on a disabled guard, and `graceful_shutdown()` is a successful no-op there.
12+
- **Inline config in the macro**: `#[dial9_tokio_telemetry::main]` now accepts a closure, so simple setups no longer need a separate config function.
13+
- **Fluent config builder**: `Dial9Config::builder()` with named setters, `with_tokio`/`with_runtime` closures, and `.enabled(bool)`. The original positional API under `dial9_tokio_telemetry::config` is unchanged.
14+
- **`build_or_disabled()`**: on config validation or writer I/O failure, logs an error and starts a plain tokio runtime instead of crashing. Use `build()` to handle failures explicitly.
15+
16+
All three in action:
17+
18+
```rust
19+
#[dial9_tokio_telemetry::main(config = || {
20+
Dial9Config::builder()
21+
.base_path("/tmp/trace.bin")
22+
.max_file_size(64 * 1024 * 1024)
23+
.max_total_size(256 * 1024 * 1024)
24+
.build_or_disabled()
25+
})]
26+
async fn main() { /* ... */ }
27+
```
2028

2129
### Changed
2230

23-
- **Breaking:** `TelemetryHandle::current()` no longer panics off-runtime. It now returns an inert handle when called from a thread that is not owned by a dial9 runtime, mirroring the always-present-guard model. Use `TelemetryHandle::is_enabled()` to branch on whether telemetry is live on the current thread.
24-
- **Breaking:** `TelemetryHandle::try_current()` is deprecated in favor of `current()`; call sites that relied on the `Option` ceremony can now drop it.
25-
- **Breaking:** `TracedRuntime::guard()` returns `&TelemetryGuard` (always-present) instead of `Option<&TelemetryGuard>`. Replace `rt.guard().is_some()` with `rt.guard().is_enabled()`.
26-
- **Breaking:** the high-level `TracedRuntime` type at the crate root supersedes the previous `TelemetryRuntime` (which was renamed). The `#[dial9_tokio_telemetry::main]` macro now expands to `TracedRuntime::new(...)`.
31+
- `TelemetryHandle::current()` no longer panics off-runtime. It returns an inert handle whose `spawn` falls through to `tokio::spawn`. Use `TelemetryHandle::is_enabled()` to check whether telemetry is live.
2732

2833
## [0.3.3](https://github.com/dial9-rs/dial9-tokio-telemetry/compare/dial9-tokio-telemetry-v0.3.2...dial9-tokio-telemetry-v0.3.3) - 2026-04-20
2934

dial9-tokio-telemetry/README.md

Lines changed: 59 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Without this flag, compilation will fail with errors about missing methods on `t
2424

2525
## Quick start
2626

27-
There are two ways to set up dial9: the `#[main]` macro (recommended for most apps) or manual `TracedRuntime` setup. The macro handles the boilerplate of building the runtime and spawning your code as an instrumented task. Inside your `main` body, call `TelemetryHandle::current()` to get the handle for wake-event tracking. Use manual setup when you have multiple Tokio runtimes, don't own `main` (e.g., library code or embedded services), or need to integrate with existing runtime-building code.
27+
There are two ways to set up dial9: the `#[main]` macro (recommended for most apps) or `TracedRuntime::new()`/`try_new()` for manual control. Both use `Dial9Config::builder()` to configure telemetry. Inside your `main` body, call `TelemetryHandle::current()` to get the handle for wake-event tracking.
2828

2929
### Using the `#[main]` macro
3030

@@ -41,11 +41,10 @@ fn my_config() -> Dial9Config {
4141
.rotation_period(std::time::Duration::from_secs(300)) // optional: rotate every 5 min (default: 60 s)
4242
.with_runtime(|r| r.with_runtime_name("main").with_task_tracking(true)) // TracedRuntime knobs
4343
.with_tokio(|t| { t.worker_threads(4); }) // tokio knobs
44-
.build()
45-
.expect("config build failed")
44+
.build_or_disabled() // or use build() to handle config failures explicitly
4645
}
4746
48-
#[dial9_tokio_telemetry::main(config = my_config)]
47+
#[dial9_tokio_telemetry::main(config = my_config)] // inline config function is also supported
4948
async fn main() {
5049
// your async code here
5150
// `TelemetryHandle::current()` returns the per-thread handle for
@@ -60,45 +59,11 @@ async fn main() {
6059

6160
The macro automatically spawns your function body as a task, so top-level code is visible in traces (unlike plain `#[tokio::main]` where `block_on` work is invisible — see [below](#the-root-future-is-not-instrumented)). dial9 installs a `TelemetryHandle` on every runtime-owned thread via `on_thread_start`. Call `TelemetryHandle::current()` to get it for spawning wake-tracked sub-tasks.
6261

63-
### Optional telemetry on I/O failure
64-
65-
`build()` is strict: missing required writer fields, or an unwritable `base_path`, surface as a `Dial9ConfigBuilderError` (panicking through the macro). When telemetry is best-effort — e.g. you'd rather start the service than fail on a misconfigured trace path — finish with `build_or_disabled()` instead. It returns the same `Dial9Config` type, but:
66-
67-
- emits a `tracing::error!` log on the `dial9_telemetry` target when validation or writer-I/O probing fails, and
68-
- downgrades to a disabled config that still carries your `with_tokio` configurators (worker count, thread names, etc. are preserved).
69-
70-
```rust,no_run
71-
use dial9_tokio_telemetry::Dial9Config;
72-
73-
fn my_config() -> Dial9Config {
74-
Dial9Config::builder()
75-
.base_path("/tmp/my_traces/trace.bin")
76-
.max_file_size(1024 * 1024)
77-
.max_total_size(5 * 1024 * 1024)
78-
.with_tokio(|t| { t.worker_threads(4); })
79-
.build_or_disabled()
80-
}
81-
82-
#[dial9_tokio_telemetry::main(config = my_config)]
83-
async fn main() {
84-
// Telemetry may or may not be active; the returned handle is inert
85-
// when the lenient downgrade fired.
86-
use dial9_tokio_telemetry::telemetry::TelemetryHandle;
87-
let handle = TelemetryHandle::current();
88-
// `handle.spawn` records wake events when telemetry is live and
89-
// falls through to plain `tokio::spawn` when it is not.
90-
handle.spawn(async { /* ... */ });
91-
if handle.is_enabled() {
92-
// any code paths specific to "telemetry on"
93-
}
94-
}
95-
```
96-
97-
See [`examples/optional_telemetry.rs`](/dial9-tokio-telemetry/examples/optional_telemetry.rs) for an end-to-end run including a `DIAL9_TRACE_PATH=/unwritable/...` mode that exercises the downgrade.
62+
`build_or_disabled()` returns a pass-through config on I/O or validation failure, so the service starts on a plain tokio runtime instead of crashing. `TelemetryHandle::current()` returns an inert handle in that case, and `handle.spawn` falls through to `tokio::spawn`.
9863

9964
### Without the macro
10065

101-
The macro expands to `TracedRuntime::new(...).block_on(...)`. If you'd rather drive that yourself — for tests, libraries that build their own runtime, or any code that doesn't own `main` `TracedRuntime` is a public type that accepts a `Dial9Config`:
66+
The macro expands to `TracedRuntime::new(...).block_on(...)`. If you'd rather drive that yourself (graceful shutdown, multiple runtimes, tests, or any code that doesn't own `main`), `TracedRuntime` is a public type that accepts a `Dial9Config`:
10267

10368
```rust,no_run
10469
use dial9_tokio_telemetry::{Dial9Config, TracedRuntime};
@@ -107,47 +72,15 @@ let cfg = Dial9Config::builder()
10772
.base_path("/tmp/my_traces/trace.bin")
10873
.max_file_size(1024 * 1024)
10974
.max_total_size(5 * 1024 * 1024)
110-
.build()
111-
.expect("config build failed");
75+
.build_or_disabled();
11276
113-
let rt = TracedRuntime::try_new(cfg).expect("runtime build failed");
77+
let rt = TracedRuntime::try_new(cfg).expect("tokio runtime failed to start");
11478
rt.block_on(async {
11579
// body runs as a spawned, instrumented task — same as under #[main]
11680
});
11781
```
11882

119-
### Manual setup
120-
121-
```rust
122-
use dial9_tokio_telemetry::telemetry::{RotatingWriter, TracedRuntime};
123-
124-
fn main() -> std::io::Result<()> {
125-
let writer = RotatingWriter::builder()
126-
.base_path("/tmp/my_traces/trace.bin")
127-
.max_file_size(100 * 1024 * 1024) // safety valve at 100 MiB per file
128-
.max_total_size(500 * 1024 * 1024) // keep at most 500 MiB on disk
129-
// .rotation_period(std::time::Duration::from_secs(300)) // optional: rotate every 5 min (default: 60 s)
130-
.build()?;
131-
132-
let mut builder = tokio::runtime::Builder::new_multi_thread();
133-
builder.worker_threads(4).enable_all();
134-
135-
let (runtime, guard) = TracedRuntime::build_and_start(builder, writer)?;
136-
let handle = guard.handle();
137-
138-
runtime.block_on(async {
139-
handle.spawn(async {
140-
// your async code here will be instrumented
141-
}).await.unwrap();
142-
});
143-
144-
Ok(())
145-
}
146-
```
147-
148-
Events are 6–16 bytes on the wire, and a typical request generates ~20–35 bytes of trace data (a few poll events plus park/unpark). At 10k requests/sec that's well under 1 MB/s — `RotatingWriter` caps total disk usage so you can leave it running indefinitely. Typical CPU overhead is under 5%.
149-
150-
Segments rotate on size _or_ time, whichever comes first. Time boundaries are wall-clock-aligned (e.g. a 60 s period rotates at the top of each minute), which produces clean S3 key paths when using the `worker-s3` feature.
83+
For lower-level control (custom `TraceWriter`, multiple runtimes sharing one telemetry session, or direct access to the `TelemetryGuard`), see `TracedRuntime::builder()` and `TelemetryCore::builder()` in the API docs.
15184

15285
## Can I use this in prod?
15386

@@ -223,22 +156,16 @@ To understand when Tokio itself is delaying your code (scheduler delay), you nee
223156
Use `handle.spawn()` instead of `tokio::spawn()`:
224157

225158
```rust,no_run
226-
# use dial9_tokio_telemetry::telemetry::{RotatingWriter, TracedRuntime};
227-
# fn main() -> std::io::Result<()> {
228-
# let writer = RotatingWriter::new("/tmp/t.bin", 1024, 4096)?;
229-
# let builder = tokio::runtime::Builder::new_multi_thread();
230-
let (runtime, guard) = TracedRuntime::build_and_start(builder, writer)?;
231-
let handle = guard.handle();
159+
use dial9_tokio_telemetry::telemetry::TelemetryHandle;
232160
233-
runtime.block_on(async {
234-
// wake events / scheduling delay captured
235-
handle.spawn(async { /* ... */ });
161+
// Inside a dial9 runtime (macro or TracedRuntime):
162+
let handle = TelemetryHandle::current();
236163
237-
// this task is still tracked, but won't have wake events
238-
tokio::spawn(async { /* ... */ });
239-
});
240-
# Ok(())
241-
# }
164+
// wake events / scheduling delay captured
165+
handle.spawn(async { /* ... */ });
166+
167+
// this task is still tracked, but won't have wake events
168+
tokio::spawn(async { /* ... */ });
242169
```
243170

244171
For frameworks like Axum where you don't control the spawn call, you need to wrap the accept loop. See [`examples/metrics-service/src/axum_traced.rs`](/examples/metrics-service/src/axum_traced.rs) for a working example that wraps both the accept loop and per-connection futures.
@@ -301,21 +228,26 @@ Both of these events are tied to the precise instant and thread that they happen
301228

302229
```rust,no_run
303230
# #[cfg(feature = "cpu-profiling")]
304-
# fn main() -> std::io::Result<()> {
305-
# use dial9_tokio_telemetry::telemetry::{RotatingWriter, TracedRuntime};
231+
# mod inner {
232+
use dial9_tokio_telemetry::Dial9Config;
306233
use dial9_tokio_telemetry::telemetry::cpu_profile::{CpuProfilingConfig, SchedEventConfig};
307234
308-
# let writer = RotatingWriter::new("/tmp/t.bin", 1024, 4096)?;
309-
# let builder = tokio::runtime::Builder::new_multi_thread();
310-
let (runtime, guard) = TracedRuntime::builder()
311-
.with_task_tracking(true)
312-
.with_cpu_profiling(CpuProfilingConfig::default())
313-
.with_sched_events(SchedEventConfig::default().include_kernel(true))
314-
.with_trace_path("/tmp/t.bin")
315-
.build_and_start(builder, writer)?;
316-
# Ok(())
235+
fn my_config() -> Dial9Config {
236+
Dial9Config::builder()
237+
.base_path("/tmp/my_traces/trace.bin")
238+
.max_file_size(100 * 1024 * 1024)
239+
.max_total_size(500 * 1024 * 1024)
240+
.with_runtime(|r| {
241+
r.with_task_tracking(true)
242+
.with_cpu_profiling(CpuProfilingConfig::default())
243+
.with_sched_events(SchedEventConfig::default().include_kernel(true))
244+
})
245+
.build_or_disabled()
246+
}
247+
248+
#[dial9_tokio_telemetry::main(config = my_config)]
249+
async fn main() { /* ... */ }
317250
# }
318-
# #[cfg(not(feature = "cpu-profiling"))]
319251
# fn main() {}
320252
```
321253

@@ -355,37 +287,14 @@ sudo sysctl kernel.kptr_restrict=0
355287

356288
Because CPU samples are tagged with the worker thread they were collected on, and the trace records which task is being polled on each worker at each instant, the viewer can correlate samples with individual polls. When a poll takes an unusually long time (a "long poll"), the CPU samples collected during that poll show you exactly what code was running — expensive serialization, accidental blocking I/O, lock contention, etc. In the trace viewer, click on a long poll to see its flamegraph, or shift+drag to aggregate CPU samples across a time range.
357289

358-
## Getting started
359-
360-
`TracedRuntime::build` returns a `(Runtime, TelemetryGuard)`. The guard owns the flush thread and provides a `TelemetryHandle` for enabling/disabling recording at runtime:
361-
362-
```rust,no_run
363-
# use dial9_tokio_telemetry::telemetry::{RotatingWriter, TracedRuntime};
364-
# fn main() -> std::io::Result<()> {
365-
# let writer = RotatingWriter::new("/tmp/t.bin", 1024, 4096)?;
366-
# let builder = tokio::runtime::Builder::new_multi_thread();
367-
let (runtime, guard) = TracedRuntime::builder()
368-
.with_task_tracking(true)
369-
.build(builder, writer)?;
370-
371-
// start disabled, enable later
372-
guard.enable();
373-
374-
// TelemetryHandle is Clone + Send — pass it around
375-
let handle = guard.handle();
376-
handle.disable();
377-
# Ok(())
378-
# }
379-
```
380-
381-
### Multiple runtimes
290+
## Multiple runtimes
382291

383292
For applications with multiple Tokio runtimes (e.g. thread-per-core, or separate request/IO runtimes), use `TelemetryCore` to create the telemetry session first, then attach each runtime:
384293

385294
```rust,no_run
386295
# use dial9_tokio_telemetry::telemetry::{RotatingWriter, TelemetryCore};
387296
# fn main() -> std::io::Result<()> {
388-
# let writer = RotatingWriter::new("/tmp/t.bin", 1024, 4096)?;
297+
# let writer = RotatingWriter::new("/tmp/t.bin", 100 * 1024 * 1024, 500 * 1024 * 1024)?;
389298
let guard = TelemetryCore::builder()
390299
.writer(writer)
391300
.trace_path("/tmp/t.bin")
@@ -411,11 +320,11 @@ See [`examples/thread_per_core.rs`](/dial9-tokio-telemetry/examples/thread_per_c
411320

412321
**Shutdown**: Drop all runtimes before the `TelemetryGuard` so worker threads exit and flush their thread-local buffers. For a clean shutdown that waits for the background worker (e.g. S3 uploads) to drain, call `guard.graceful_shutdown(timeout)` instead of dropping the guard.
413322

414-
### Writers
323+
## Writers
415324

416325
`RotatingWriter` rotates files based on size and time, and evicts old ones to stay within a total size budget. By default, segments rotate every 60 seconds (wall-clock-aligned) or when they exceed `max_file_size`, whichever comes first. Time-based rotation produces clean segment boundaries (thread-local buffers are drained before sealing), so set `max_file_size` large enough that time-based rotation fires first under normal conditions (100 MiB is a good default). Size-based rotation then acts as a safety valve for unexpected data bursts. For quick experiments, use `RotatingWriter::single_file(path)` to skip rotation entirely.
417326

418-
### Analyzing traces
327+
## Analyzing traces
419328

420329
[`dial9-viewer`](/dial9-viewer) is an interactive trace viewer and S3 browser. Point it at a local directory or an S3 bucket to browse and visualize traces in the browser. [Here's a demo.](https://www.youtube.com/watch?v=zJOzU_6Mf7Q)
421330

@@ -459,39 +368,34 @@ Only `bucket` and `service_name` are required. See `S3Config` for additional opt
459368

460369
```rust,no_run
461370
# #[cfg(feature = "worker-s3")]
462-
# fn main() -> std::io::Result<()> {
463-
use dial9_tokio_telemetry::telemetry::{RotatingWriter, TracedRuntime};
371+
# mod inner {
372+
use dial9_tokio_telemetry::Dial9Config;
464373
use dial9_tokio_telemetry::background_task::s3::S3Config;
465374
466-
let trace_path = "/tmp/my_traces/trace.bin";
467-
let writer = RotatingWriter::builder()
468-
.base_path(trace_path)
469-
.max_file_size(100 * 1024 * 1024) // safety valve at 100 MiB per file
470-
.max_total_size(500 * 1024 * 1024) // keep at most 500 MiB on disk
471-
.build()?;
472-
473-
let s3_config = S3Config::builder()
474-
.bucket("my-trace-bucket")
475-
.service_name("my-service")
476-
.build();
477-
478-
let mut builder = tokio::runtime::Builder::new_multi_thread();
479-
builder.worker_threads(4).enable_all();
375+
fn my_config() -> Dial9Config {
376+
let s3_config = S3Config::builder()
377+
.bucket("my-trace-bucket")
378+
.service_name("my-service")
379+
.build();
480380
481-
let (runtime, guard) = TracedRuntime::builder()
482-
.with_task_tracking(true)
483-
.with_trace_path(trace_path)
484-
.with_s3_uploader(s3_config)
485-
.build_and_start(builder, writer)?;
381+
Dial9Config::builder()
382+
.base_path("/tmp/my_traces/trace.bin")
383+
.max_file_size(100 * 1024 * 1024)
384+
.max_total_size(500 * 1024 * 1024)
385+
.with_tokio(|t| { t.worker_threads(4); })
386+
.with_runtime(|r| {
387+
r.with_task_tracking(true)
388+
.with_s3_uploader(s3_config)
389+
})
390+
.build_or_disabled()
391+
}
486392
487-
runtime.block_on(async {
393+
#[dial9_tokio_telemetry::main(config = my_config)]
394+
async fn main() {
488395
// your async code here
489-
});
490-
491-
// guard drop: flushes, seals final segment, worker drains remaining to S3
492-
# Ok(())
396+
}
397+
// on shutdown: flushes, seals final segment, worker drains remaining to S3
493398
# }
494-
# #[cfg(not(feature = "worker-s3"))]
495399
# fn main() {}
496400
```
497401

@@ -505,7 +409,6 @@ The worker uses a circuit breaker with exponential backoff if S3 is unreachable.
505409

506410
```bash
507411
cargo run --example simple_workload # macro-based setup (start here)
508-
cargo run --example optional_telemetry # build_or_disabled: best-effort telemetry, plain-tokio fallback
509412
cargo run --example conditionally_enable # toggle telemetry via ENABLE_DIAL9 env var
510413
cargo run --example realistic_workload # mixed CPU/IO workload
511414
cargo run --example long_workload # longer run for trace analysis

0 commit comments

Comments
 (0)