Skip to content

Commit cc4374e

Browse files
rcohjlizen
andauthored
Add dial9-in-prod example (#335)
* Add dial9-in-prod example * docs: add to readme * fix: update dial9-viewer build.rs for renamed README section The previous commit renamed the README heading `## Wake event tracking` to `## Wake event tracking (opt-in)` (to match the sibling `## Tracing span events (opt-in)`). `dial9-viewer/build.rs` matches section headings exactly via `SETUP_SECTIONS` and panics loudly if one is renamed, which broke clippy, ubuntu-nightly build, docs.rs check, and cargo package CI jobs. Update the constant to the new heading. * fix: typos and doc accuracy in production_use example - Fix env var table: DIAL9_ENABLED only accepts true/false (not 1/yes) since parse_env uses bool::from_str - Fix README typo: 'read to be polled' → 'ready to be polled' - Fix doc comment typo: 'an even' → 'an event' - Fix duplicate 'a' across line break in doc comment * Add a 'getting-useful-data' section * Refine prod guidance * refactor(examples): simplify production_use against new Dial9Config::builder API Drop `try_current` and `parse_or_fallback` now that `build_or_disabled` returns a pass-through config on writer failures and `TelemetryHandle::current` is inert when disabled, so application code runs unchanged whether dial9 is recording or not. Fold the four cfg-gated apply_* helpers into a single `with_runtime` closure. --------- Co-authored-by: Jess Izen <jlizen@amazon.com>
1 parent ae993ba commit cc4374e

3 files changed

Lines changed: 294 additions & 1 deletion

File tree

dial9-tokio-telemetry/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ worker-s3 = ["dep:aws-sdk-s3-transfer-manager", "dep:aws-sdk-s3", "dep:aws-confi
5757
dial9-tokio-telemetry = { path = ".", features = ["analysis", "tracing-layer", "worker-s3"] }
5858
assert2 = { workspace = true }
5959
criterion = "0.5"
60-
clap = { version = "4", features = ["derive"] }
60+
clap = { version = "4", features = ["derive", "env"] }
6161
hdrhistogram = "7"
6262
metrique-timesource = { version = "0.1", features = ["custom-timesource", "tokio"] }
6363
metrique-writer = { version = "0.1", features = ["test-util"] }

dial9-tokio-telemetry/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ Without this flag, compilation will fail with errors about missing methods on `t
3838

3939
## Setup
4040

41+
If you are integrating dial9 into a production service, see the [`production_use` example](https://github.com/dial9-rs/dial9-tokio-telemetry/blob/main/dial9-tokio-telemetry/examples/production_use.rs).
42+
4143
> **Note:** `#[dial9_tokio_telemetry::main]` is a **replacement** for `#[tokio::main]`, not a complement — do not use both on the same function. The macro builds and configures the Tokio runtime internally.
4244
4345
```rust,no_run
Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
//! This example discusses things to consider when using and setting up dial9 in production
2+
//!
3+
//! It also provides an opinionated set of knobs and environment variables to configure your application.
4+
//!
5+
//! Many applications can run with dial9 enabled all the time. Some applications will have worse performance, especially applications that perform
6+
//! a very small amount of "useful work" per poll. Applications that use an extremely large number of worker threads will experience higher memory usage.
7+
//! For an average production service, Dial9 produces 50-100GB / day.
8+
//!
9+
//! Since dial9 is recording an event on each poll (~50ns), if your polls are very short then this fix cost of overhead will impact your application performance.
10+
//!
11+
//! ### Enabling and disabling
12+
//!
13+
//! - Setting [`.enabled(false)`](dial9_tokio_telemetry::Dial9ConfigBuilder::enabled) on the config builder produces a
14+
//! pass-through config: the `#[main]` macro builds a plain, unmodified tokio runtime with zero dial9 overhead.
15+
//! [`TelemetryHandle::current()`](dial9_tokio_telemetry::telemetry::TelemetryHandle::current) returns an inert
16+
//! handle, and `handle.spawn` falls through to `tokio::spawn`, so application code does not need branches.
17+
//! - Alternatively, you can install dial9 but leave recording disabled at runtime via the handle's
18+
//! [`disable()`](dial9_tokio_telemetry::telemetry::TelemetryHandle::disable). The runtime hooks are installed
19+
//! but all event writes are no-ops behind a relaxed atomic read. This has slightly more overhead than
20+
//! [`.enabled(false)`](dial9_tokio_telemetry::Dial9ConfigBuilder::enabled) but lets a background task flip
21+
//! recording on from dynamic configuration later. It is a larger surface area of code, so it is higher risk.
22+
//!
23+
//! > Note! dial9 must be created _before_ your async runtime. dial9 relies on installing itself into the runtime
24+
//! > telemetry hooks to produce Tokio events.
25+
//!
26+
//! ### The overhead of running dial9
27+
//!
28+
//! 1. Dial9 allocates a 1MB buffer for each thread that records events. If you are recording events from a huge number of threads, this can bloat memory.
29+
//! 2. When Tokio telemetry is enabled, 2 dial9 events will be emitted by every poll. If your poll times are extremely short and your application is CPU bound then this overhead can be significant.
30+
//!
31+
//! Dial9 has many possible components you can enable. The more components, the more data you will produce and the more overhead your application will have.
32+
//!
33+
//! #### CPU Profiling
34+
//! Dial9 has two types of CPU profiling available:
35+
//! 1. CPU profiling (this is what you would normally consider CPU profiling): Dial9 is sampling stack traces that are running on the CPU.
36+
//! 2. Schedule Profiling: dial9 subscribes to the sched-switch linux kernel event and can capture a stack trace when your application is descheduled by the kernel. In order
37+
//! to subscribe to these events, dial9 must open one perf fd per worker thread.
38+
//!
39+
//! #### Tokio Telemetry
40+
//! The basic Tokio telemetry will install a few hooks:
41+
//! 1. `on_before_poll` and `on_after_poll` callbacks: This have a nanosecond level of overhead on each poll, just from the dynamic dispatch.
42+
//! 2. `on_worker_park/unpark`: When these events happen, dial9 reads a few kernel APIs (~1us) to help to understand how Tokio is interacting with the OS.
43+
//!
44+
//! When you use dial9's spawn method, your futures are wrapped in a future that tracks `wake` events. This is done by instrumenting the waker. This is optional but
45+
//! allows you to understand scheduling delay of the tasks you are running.
46+
//!
47+
//! #### Tracing
48+
//! Dial9 can capture Tracing spans via the `TracingLayer`. On the scale of tracing, this is fairly low overhead, however, if you have a large amount of deeply nested spans, this can produce a huge amount
49+
//! of data. We recommend using a very fine-grained filter.
50+
//!
51+
//! The rest of this example shows an opinionated way to wire dial9
52+
//! so it can be enabled and tuned with CLI flags or environment variables
53+
//! (via [`clap`]). The same binary can then run in dev, staging, and prod
54+
//! with different tracing behavior, and tooling (CDK, Docker, k8s, etc.)
55+
//! can flip knobs without a rebuild.
56+
//!
57+
//! ### Getting Useful Data
58+
//!
59+
//! To get the most use out of dial9, you need application-specific events in your traces to make sense of your data. The best way to do this is to emit some sort
60+
//! of request id into your dial9 traces. There are a couple of ways to do this:
61+
//!
62+
//! 1. Use the `Dial9TokioLayer`, which is a tracing layer that allows dial9 to capture your tracing spans directly. Typically, you will
63+
//! select a narrow set of top-level spans to track.
64+
//! 2. Emit an event when your request starts and when your request stops. Because these should _normally_ always be on the same Tokio Task, we can
65+
//! correlate post-hoc to figure exactly which polls belonged to which requests.
66+
//!
67+
//! # Configuration (CLI flags / environment variables)
68+
//!
69+
//! Every option can be set as a `--flag` or via its environment variable.
70+
//! Run with `--help` for full usage.
71+
//!
72+
//! | Name | Default | Meaning |
73+
//! | --------------------------------- | ------------------------------- | ------------------------------------------------------------- |
74+
//! | `DIAL9_ENABLED` | `false` | Master switch. `true`/`false` only (Rust `bool::from_str`). |
75+
//! | `DIAL9_TRACE_DIR` | `/tmp/dial9-traces` | Directory to write rotated trace segments into. |
76+
//! | `DIAL9_ROTATION_SECS` | `60` | Wall-clock rotation period in seconds. |
77+
//! | `DIAL9_MAX_DISK_USAGE_MB` | `1024` | Upper bound on total on-disk bytes (old files evicted). |
78+
//! | `DIAL9_S3_BUCKET` | unset / empty | When set, sealed segments are gzip-uploaded to this bucket. |
79+
//! | `DIAL9_SERVICE_NAME` | binary name | Service name used in the S3 key layout (required with S3). |
80+
//! | `DIAL9_CPU_PROFILE_ENABLED` | `true` on Linux, `false` else | Enables `perf_event_open`-based CPU sampling. |
81+
//! | `DIAL9_CPU_SAMPLE_HZ` | `99` | Sampling frequency for CPU profiling. |
82+
//! | `DIAL9_SCHEDULE_PROFILE_ENABLED` | `true` on Linux, `false` else | Enables per-worker scheduler event capture (context switches).|
83+
//!
84+
//! # Invalid configuration
85+
//!
86+
//! Invalid operator input for the knobs above (unknown boolean, non-numeric
87+
//! duration, etc.) is caught by `clap` and exits with a diagnostic, as it
88+
//! would for any misconfigured CLI tool. Invalid _dial9_ configuration
89+
//! (writer I/O failure, unwritable trace directory, etc.) is handled
90+
//! lazily by [`dial9_tokio_telemetry::Dial9ConfigBuilder::build_or_disabled`]: the builder logs
91+
//! the error and falls back to a plain tokio runtime with telemetry
92+
//! disabled. A bad trace config must never take down prod.
93+
//!
94+
//! # Running the example
95+
//!
96+
//! ```sh
97+
//! # plain run: telemetry disabled
98+
//! cargo run --example production_use
99+
//!
100+
//! # basic local tracing (env var)
101+
//! DIAL9_ENABLED=true cargo run --example production_use
102+
//!
103+
//! # basic local tracing (CLI flag)
104+
//! cargo run --example production_use -- --enabled
105+
//!
106+
//! # with CPU profiling + schedule events (Linux, requires feature flag)
107+
//! DIAL9_ENABLED=true \
108+
//! cargo run --features cpu-profiling --example production_use
109+
//!
110+
//! # with S3 upload (requires feature flag and AWS creds in env)
111+
//! cargo run --features worker-s3 --example production_use -- \
112+
//! --enabled --s3-bucket my-trace-bucket --service-name my-service
113+
//! ```
114+
115+
use std::time::Duration;
116+
117+
use clap::Parser;
118+
use dial9_tokio_telemetry::Dial9Config;
119+
use dial9_tokio_telemetry::telemetry::TelemetryHandle;
120+
121+
const LINUX: bool = cfg!(target_os = "linux");
122+
123+
/// Opinionated configuration for a production dial9 deployment.
124+
///
125+
/// All fields can be set via environment variables (shown in help) or CLI flags.
126+
#[derive(Debug, Parser)]
127+
struct Dial9Opts {
128+
/// Master switch for dial9 telemetry.
129+
#[arg(long, env = "DIAL9_ENABLED", default_value_t = false)]
130+
enabled: bool,
131+
132+
/// Directory to write rotated trace segments into.
133+
#[arg(long, env = "DIAL9_TRACE_DIR", default_value = "/tmp/dial9-traces")]
134+
trace_dir: String,
135+
136+
/// Wall-clock rotation period in seconds.
137+
#[arg(long, env = "DIAL9_ROTATION_SECS", default_value_t = 60)]
138+
rotation_secs: u64,
139+
140+
/// Upper bound on total on-disk usage in MiB (old files evicted).
141+
#[arg(long, env = "DIAL9_MAX_DISK_USAGE_MB", default_value_t = 1024)]
142+
max_disk_usage_mb: u64,
143+
144+
/// S3 bucket for uploading sealed segments (requires --service-name).
145+
#[arg(long, env = "DIAL9_S3_BUCKET", requires = "service_name")]
146+
s3_bucket: Option<String>,
147+
148+
/// Service name used in the S3 key layout.
149+
#[arg(long, env = "DIAL9_SERVICE_NAME")]
150+
service_name: Option<String>,
151+
152+
/// Enable perf_event_open-based CPU sampling (Linux only).
153+
#[arg(long, env = "DIAL9_CPU_PROFILE_ENABLED", default_value_t = LINUX)]
154+
cpu_profile_enabled: bool,
155+
156+
/// Sampling frequency for CPU profiling.
157+
#[arg(long, env = "DIAL9_CPU_SAMPLE_HZ", default_value_t = 99)]
158+
cpu_sample_hz: u64,
159+
160+
/// Enable per-worker scheduler event capture (Linux only).
161+
#[arg(long, env = "DIAL9_SCHEDULE_PROFILE_ENABLED", default_value_t = LINUX)]
162+
schedule_profile_enabled: bool,
163+
}
164+
165+
impl Dial9Opts {
166+
fn rotation(&self) -> Duration {
167+
Duration::from_secs(self.rotation_secs)
168+
}
169+
170+
fn max_disk_usage_bytes(&self) -> u64 {
171+
self.max_disk_usage_mb.saturating_mul(1024 * 1024)
172+
}
173+
}
174+
175+
/// Translate parsed options into a [`Dial9Config`] the `#[main]` macro can consume.
176+
///
177+
/// `build_or_disabled()` is the important piece here: any writer I/O or
178+
/// validation failure (unwritable `trace_dir`, zero-sized budget, etc.)
179+
/// logs an error and returns a pass-through config that builds a plain
180+
/// tokio runtime. In both the disabled and fallback cases,
181+
/// `TelemetryHandle::current()` returns an inert handle and
182+
/// `handle.spawn` delegates to `tokio::spawn`, so application code does
183+
/// not need to branch on whether dial9 is running.
184+
fn configure_dial9(opts: &Dial9Opts) -> Dial9Config {
185+
if opts.enabled
186+
&& let Err(e) = std::fs::create_dir_all(&opts.trace_dir)
187+
{
188+
eprintln!("warning: could not create {}: {e}", opts.trace_dir);
189+
}
190+
191+
let base_path = format!("{}/trace.bin", opts.trace_dir.trim_end_matches('/'));
192+
let max_disk = opts.max_disk_usage_bytes();
193+
let max_file_size = (max_disk / 4).max(16 * 1024 * 1024);
194+
195+
if opts.enabled {
196+
warn_if_feature_missing(opts);
197+
}
198+
199+
#[cfg(feature = "cpu-profiling")]
200+
let (cpu_enabled, sched_enabled, cpu_hz) = (
201+
opts.cpu_profile_enabled,
202+
opts.schedule_profile_enabled,
203+
opts.cpu_sample_hz,
204+
);
205+
#[cfg(feature = "worker-s3")]
206+
let (s3_bucket, s3_service) = (opts.s3_bucket.clone(), opts.service_name.clone());
207+
208+
Dial9Config::builder()
209+
.enabled(opts.enabled)
210+
.base_path(base_path)
211+
.max_file_size(max_file_size)
212+
.max_total_size(max_disk)
213+
.rotation_period(opts.rotation())
214+
.with_runtime(move |mut r| {
215+
r = r.with_task_tracking(true);
216+
#[cfg(feature = "cpu-profiling")]
217+
{
218+
use dial9_tokio_telemetry::telemetry::cpu_profile::{
219+
CpuProfilingConfig, SchedEventConfig,
220+
};
221+
if cpu_enabled {
222+
r = r.with_cpu_profiling(CpuProfilingConfig::default().frequency_hz(cpu_hz));
223+
}
224+
if sched_enabled {
225+
r = r.with_sched_events(SchedEventConfig::default());
226+
}
227+
}
228+
#[cfg(feature = "worker-s3")]
229+
{
230+
use dial9_tokio_telemetry::background_task::s3::S3Config;
231+
if let (Some(bucket), Some(service_name)) = (&s3_bucket, &s3_service) {
232+
let s3 = S3Config::builder()
233+
.bucket(bucket.clone())
234+
.service_name(service_name.clone())
235+
.build();
236+
r = r.with_s3_uploader(s3);
237+
}
238+
}
239+
r
240+
})
241+
.build_or_disabled()
242+
}
243+
244+
/// Complain at startup when the operator asked for something a feature flag
245+
/// disables, so silent misconfiguration doesn't go unnoticed.
246+
fn warn_if_feature_missing(opts: &Dial9Opts) {
247+
if cfg!(not(feature = "cpu-profiling"))
248+
&& (opts.cpu_profile_enabled || opts.schedule_profile_enabled)
249+
{
250+
eprintln!(
251+
"warning: cpu/schedule profiling requested but --features cpu-profiling not enabled; ignoring"
252+
);
253+
}
254+
if cfg!(not(feature = "worker-s3")) && opts.s3_bucket.is_some() {
255+
eprintln!(
256+
"warning: DIAL9_S3_BUCKET set but --features worker-s3 not enabled; traces only on local disk"
257+
);
258+
}
259+
}
260+
261+
fn my_config() -> Dial9Config {
262+
let opts = Dial9Opts::parse();
263+
eprintln!(
264+
"dial9 telemetry: {}",
265+
if opts.enabled {
266+
format!("enabled, writing to {}", opts.trace_dir)
267+
} else {
268+
"disabled (set DIAL9_ENABLED=true or --enabled to enable)".into()
269+
}
270+
);
271+
configure_dial9(&opts)
272+
}
273+
274+
async fn workload_task(id: usize) {
275+
for i in 0..5 {
276+
tokio::time::sleep(Duration::from_millis(5)).await;
277+
if i == 0 && id.is_multiple_of(25) {
278+
println!("task {id} working");
279+
}
280+
}
281+
}
282+
283+
#[dial9_tokio_telemetry::main(config = my_config)]
284+
async fn main() {
285+
let handle = TelemetryHandle::current();
286+
let tasks: Vec<_> = (0..100).map(|i| handle.spawn(workload_task(i))).collect();
287+
for task in tasks {
288+
let _ = task.await;
289+
}
290+
println!("workload finished");
291+
}

0 commit comments

Comments
 (0)