Skip to content

Commit 183303a

Browse files
authored
Update README.md (#31)
* Update README.md make prod warning more obvious * Update README with demo links and wake event info Added links to demo resources and clarified wake event tracking details. * Update README for viewer hosting details
1 parent df47f97 commit 183303a

1 file changed

Lines changed: 20 additions & 10 deletions

File tree

dial9-tokio-telemetry/README.md

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,26 @@ fn main() -> std::io::Result<()> {
2727

2828
Events are 6–16 bytes on the wire, and a typical request generates ~20–35 bytes of trace data (a few poll events plus park/unpark). At 10k requests/sec that's well under 1 MB/s — `RotatingWriter` caps total disk usage so you can leave it running indefinitely. Typical CPU overhead is under 5%.
2929

30-
> **Note:** dial9-tokio-telemetry is designed for always-on production use, but it's still early software. Measure overhead and validate behavior in your environment before deploying to production.
30+
## Can I use this in prod?
31+
dial9-tokio-telemetry is designed for always-on production use, but it's still early software. Measure overhead and validate behavior in your environment before deploying to production.
3132

3233
## Is there a demo?
3334
Yes, checkout this [quick walkthrough (YouTube)](https://www.youtube.com/watch?v=zJOzU_6Mf7Q)!
3435

36+
The [viewer](https://dial9-tokio-telemetry.russell-r-cohen.workers.dev/) (autodeployed from code in `main`) is hosted on cloudflare pages for convenience, but you can also use [serve.py](/dial9-tokio-telemetry/serve.py) (pure HTML and JS, client side only)
37+
38+
<img width="1288" height="659" alt="Screenshot 2026-03-01 at 3 52 59 PM" src="https://github.com/user-attachments/assets/77225801-70b1-4aef-b064-32bc2326b1ef" />
39+
40+
3541
## Why dial9-tokio-telemetry?
3642

37-
Understanding how Tokio is actually running your application — which tasks are slow, why workers are idle, where scheduling delays come from — is hard to do from the outside. This crate records a continuous, low-overhead trace of runtime behavior.
43+
It can be hard to understand application performance and behavior, in async code. Dial9 tracks every event Tokio emits to create a detailed, micro-second-by-microsecond trace of your application behavior that you can analyze.
3844

39-
Compared to [tokio-console](https://github.com/tokio-rs/console), which is designed for live debugging, dial9-tokio-telemetry is designed for post-hoc analysis. Because traces are written to files with bounded disk usage, you can leave it running in production and come back later to deeply analyze what went wrong or why a specific request was slow. On Linux, traces include CPU profile samples and scheduler events, so you can see not just *that* a task was delayed but *what code* was running on the worker instead.
45+
Compared to [tokio-console](https://github.com/tokio-rs/console), which is designed for live debugging, dial9-tokio-telemetry is designed for post-hoc analysis. Because traces are written to files with bounded disk usage, you can leave it running in production and come back later to deeply analyze what went wrong or why a specific request was slow. On Linux, traces include CPU profile samples and kernel scheduling events, so you can see not just *that* a task was delayed but *what code* was running on the worker instead.
4046

4147
## What gets recorded automatically
4248

43-
`TracedRuntime` installs hooks on the Tokio runtime builder. These fire for every task on the runtime with no code changes required:
49+
`TracedRuntime` installs hooks on the Tokio runtime. The following events are recorded out of the box:
4450

4551
| Event | Fields |
4652
|-------|--------|
@@ -51,7 +57,7 @@ Compared to [tokio-console](https://github.com/tokio-rs/console), which is desig
5157

5258
## Wake event tracking
5359

54-
Wake events — which task woke which other task — are *not* captured automatically. Tokio's runtime hooks don't expose waker identity, so capturing this requires wrapping the future in `Traced<F>`, which installs a custom waker that records a `WakeEvent` before forwarding to the real waker.
60+
To understand when Tokio itself is delaying your code, generally referred to as scheduler delay, you need to know when your future was _ready_ to run. Wake events — which task woke which other task — are *not* captured automatically. Tokio's runtime hooks don't currently allow instrumenting wakes: capturing wakes requires wrapping the future. The simplest way to do that is by using `handle.spawn` instead of `task::spawn`.
5561

5662
Use `handle.spawn()` instead of `tokio::spawn()`:
5763

@@ -64,10 +70,10 @@ let (runtime, guard) = TracedRuntime::build_and_start(builder, Box::new(writer))
6470
let handle = guard.handle();
6571
6672
runtime.block_on(async {
67-
// wake events captured — uses Traced<F> wrapper
73+
// wake events / scheduling delay captured
6874
handle.spawn(async { /* ... */ });
6975
70-
// wake events NOT captured — still gets poll/park/queue telemetry
76+
// this task is still tracked, but won't have wake events
7177
tokio::spawn(async { /* ... */ });
7278
});
7379
# Ok(())
@@ -82,13 +88,17 @@ Core telemetry (poll timing, park/unpark, queue depth, wake events) works on all
8288

8389
On Linux, you get additional data for free:
8490
- **Thread CPU time** in park/unpark events via `CLOCK_THREAD_CPUTIME_ID` (vDSO, ~20–40 ns)
85-
- **Scheduler wait time** via `/proc/self/task/<tid>/schedstat` — shows how long the OS kept your thread off-CPU
91+
- **Scheduler wait time** via `/proc/self/task/<tid>/schedstat` — shows when the Tokio worker was not scheduled by the OS when it was ready.
8692

8793
On non-Linux platforms these fields are zero.
8894

8995
### CPU profiling (Linux only)
9096

91-
With the `cpu-profiling` feature, you can enable `perf_event_open`-based CPU sampling and scheduler event capture. This records stack traces attributed to specific worker threads, so you can see *what code* was running during a scheduling delay.
97+
With the `cpu-profiling` feature, you can enable `perf_event_open`-based CPU sampling. This gives two key pieces of data:
98+
1. Stack traces when code was running on the CPU — aka flamegraphs
99+
2. 2. Stack traces when the kernel _descheduled_ your thread. For example, if you use `std::thread::sleep` in your future or are seeing `std::sync::Mutex` contention, this will allow you to see precisely where this is happening in async code.
100+
101+
Both of these events are tied to the precise instant and thread that they happened on, so you can compare what was different between degraded and normal performance.
92102

93103
```rust,no_run
94104
# #[cfg(feature = "cpu-profiling")]
@@ -137,7 +147,7 @@ handle.disable();
137147

138148
### Writers
139149

140-
`RotatingWriter` is what you want for production — it rotates files and evicts old ones to stay within a total size budget. `SimpleBinaryWriter` writes a single file with no size management, useful for quick experiments. `NullWriter` measures hook overhead without doing any I/O.
150+
`RotatingWriter` rotates files and evicts old ones to stay within a total size budget. `SimpleBinaryWriter` writes a single file with no size management, useful for quick experiments.
141151

142152
### Analyzing traces
143153

0 commit comments

Comments
 (0)