Extract "TelemetryCore" separate from TracedRuntime#193
Conversation
…runtime
Introduces `TelemetryCore::builder().writer(w).build()` which produces a
`TelemetryGuard` without building any tokio runtime. Runtimes are then
attached via `guard.trace_runtime("name", builder)`.
- New `TelemetryCore` struct with `#[bon::builder]` for session config
- New `TelemetryGuard::trace_runtime()` method
- `TracedRuntime::builder()` reimplemented on top of TelemetryCore
- `build_with_reuse()` delegates to shared `attach_runtime()` helper
- Flush loop extracted to `run_flush_loop()` free function
- `SharedState` now stores `task_tracking_enabled` for use by `trace_runtime`
All 167 existing tests pass unchanged.
Demonstrates the decoupled pattern: create telemetry session first, then attach coordinator + per-core runtimes via trace_runtime().
…ction - multi_runtime example now uses TelemetryCore::builder() + trace_runtime() - Added "Multiple runtimes" section to README with TelemetryCore example
- Switch TelemetryCore from `#[builder(finish_fn = build)] fn builder()` to `#[builder] fn new()` — cleaner, same call site - Rename `build_with_reuse` → `build_and_attach_to_telemetry` for clarity
|
| Branch | telemetry-core |
| Testbed | ubuntu-latest |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result microseconds (µs) (Result Δ%) | Upper Boundary microseconds (µs) (Limit %) |
|---|---|---|---|
| writer_encode/batches/1 | 📈 view plot 🚷 view threshold | 7.36 µs(+0.20%)Baseline: 7.34 µs | 9.18 µs (80.16%) |
| writer_encode/batches/10 | 📈 view plot 🚷 view threshold | 73.71 µs(-0.45%)Baseline: 74.04 µs | 92.55 µs (79.64%) |
| writer_encode/batches/100 | 📈 view plot 🚷 view threshold | 755.44 µs(+1.20%)Baseline: 746.49 µs | 933.11 µs (80.96%) |
|
| Branch | telemetry-core |
| Testbed | ubuntu-latest |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result microseconds (µs) (Result Δ%) | Upper Boundary microseconds (µs) (Limit %) | Throughput | Benchmark Result operations / second (ops/s) x 1e3 (Result Δ%) | Lower Boundary operations / second (ops/s) x 1e3 (Limit %) |
|---|---|---|---|---|---|---|
| overhead::baseline::mean_lat_ns | 📈 view plot 🚷 view threshold | 399.88 µs(+4.75%)Baseline: 381.75 µs | 477.19 µs (83.80%) | |||
| overhead::baseline::p99_9_lat_ns | 📈 view plot 🚷 view threshold | 1,711.10 µs(+3.10%)Baseline: 1,659.65 µs | 2,074.56 µs (82.48%) | |||
| overhead::baseline::p99_lat_ns | 📈 view plot 🚷 view threshold | 782.34 µs(+4.77%)Baseline: 746.75 µs | 933.44 µs (83.81%) | |||
| overhead::baseline::throughput_rps | 📈 view plot 🚷 view threshold | 124.99 ops/s x 1e3(-6.14%)Baseline: 133.16 ops/s x 1e3 | 99.87 ops/s x 1e3 (79.90%) | |||
| overhead::noop::mean_lat_ns | 📈 view plot 🚷 view threshold | 360.26 µs(+2.31%)Baseline: 352.14 µs | 440.17 µs (81.85%) | |||
| overhead::noop::p99_9_lat_ns | 📈 view plot 🚷 view threshold | 977.92 µs(-2.23%)Baseline: 1,000.19 µs | 1,250.24 µs (78.22%) | |||
| overhead::noop::p99_lat_ns | 📈 view plot 🚷 view threshold | 679.93 µs(+2.78%)Baseline: 661.57 µs | 826.96 µs (82.22%) | |||
| overhead::noop::throughput_rps | 📈 view plot 🚷 view threshold | 138.75 ops/s x 1e3(-3.06%)Baseline: 143.13 ops/s x 1e3 | 107.34 ops/s x 1e3 (77.37%) | |||
| overhead::telemetry::mean_lat_ns | 📈 view plot 🚷 view threshold | 363.00 µs(+3.77%)Baseline: 349.79 µs | 437.24 µs (83.02%) | |||
| overhead::telemetry::p99_9_lat_ns | 📈 view plot 🚷 view threshold | 1,015.29 µs(-4.39%)Baseline: 1,061.95 µs | 1,327.44 µs (76.49%) | |||
| overhead::telemetry::p99_lat_ns | 📈 view plot 🚷 view threshold | 697.34 µs(+3.33%)Baseline: 674.88 µs | 843.60 µs (82.66%) | |||
| overhead::telemetry::throughput_rps | 📈 view plot 🚷 view threshold | 137.69 ops/s x 1e3(-4.40%)Baseline: 144.03 ops/s x 1e3 | 108.02 ops/s x 1e3 (78.45%) | |||
| overhead::telemetry_p99_added_latency_ns | 📈 view plot 🚷 view threshold | 18,446,744,073,709,464.00 µs(-0.00%)Baseline: 18,446,744,073,709,480.00 µs | 23,058,430,092,136,848.00 µs (80.00%) |
jlizen
left a comment
There was a problem hiding this comment.
Just some nits
Side note: telemetry_p99_added_latency_ns shows 18,446,744,073,709,516.00 us, which probably is due to underflow?
| // Write final metadata before sealing so single-segment | ||
| // traces contain runtime→worker mappings. | ||
| if let Err(e) = event_writer.write_current_segment_metadata() { | ||
| tracing::warn!("failed to write final segment metadata: {e}"); |
There was a problem hiding this comment.
we don't have pipeline metrics yet?
There was a problem hiding this comment.
we do, good call, we can add a metric here
| let start_mono_ns = crate::telemetry::events::clock_monotonic_ns(); | ||
| let shared = Arc::new(SharedState::new(start_mono_ns)); | ||
| #[allow(unused_mut)] | ||
| let mut event_writer = EventWriter::new(Box::new(writer)); |
There was a problem hiding this comment.
A bit unfortunate that we accept 'static TraceWriter, but immediately box it. But, the alternatives I see are worse for users, so probably fine.
There was a problem hiding this comment.
its static so that we can box it
| let base = shared | ||
| .next_worker_id | ||
| .fetch_add(num_workers, Ordering::Relaxed); | ||
| ctx.metrics_and_base.set((metrics, base)).ok(); |
There was a problem hiding this comment.
This is pre-existing, but should we warn on a double-set?
| // Both runtimes share a single trace file with unique worker IDs. | ||
| // The trace viewer groups workers by runtime name. | ||
| // Use main_handle.spawn() / io_handle.spawn() for wake-tracked futures. | ||
| # Ok(()) |
There was a problem hiding this comment.
nit: should we mention runtime dropping / graceful shutdown handling?
… on double-set, shutdown docs - Add write_metadata_failed and finalize_failed booleans to FlushMetrics, emitted on the final flush so failures are observable via metrics - Restructure exit path: metadata write and finalize happen before the metric is emitted; metric is always emitted on exit - Replace silent .ok() on metrics_and_base.set() with tracing::warn on double-set - Add shutdown guidance to README Multiple runtimes section
Create the FlushMetrics guard up front via append_on_drop, then mutate write_metadata_failed/finalize_failed through DerefMut on the exit path. The guard drops naturally, emitting the final metric entry.
Make FlushStats a #[metrics(subfield)] and flatten it into FlushMetrics, removing the duplicated event_count/dropped_batches/cpu_flush_duration fields. Resolves the TODO on FlushStats.
Extract
TelemetryCoreso that you can spawn Dial9 separately from spawning (potentially multiple) Tokio runtimes.