|
| 1 | +//! This example discusses things to consider when using and setting up dial9 in production |
| 2 | +//! |
| 3 | +//! It also provides an opinionated set of knobs and environment variables to configure your application. |
| 4 | +//! |
| 5 | +//! Many applications can run with dial9 enabled all the time. Some applications will have worse performance, especially applications that perform |
| 6 | +//! a very small amount of "useful work" per poll. Applications that use an extremely large number of worker threads will experience higher memory usage. |
| 7 | +//! For an average production service, Dial9 produces 50-100GB / day. |
| 8 | +//! |
| 9 | +//! Since dial9 is recording an event on each poll (~50ns), if your polls are very short then this fix cost of overhead will impact your application performance. |
| 10 | +//! |
| 11 | +//! ### Enabling and disabling |
| 12 | +//! |
| 13 | +//! - Setting [`.enabled(false)`](dial9_tokio_telemetry::Dial9ConfigBuilder::enabled) on the config builder produces a |
| 14 | +//! pass-through config: the `#[main]` macro builds a plain, unmodified tokio runtime with zero dial9 overhead. |
| 15 | +//! [`TelemetryHandle::current()`](dial9_tokio_telemetry::telemetry::TelemetryHandle::current) returns an inert |
| 16 | +//! handle, and `handle.spawn` falls through to `tokio::spawn`, so application code does not need branches. |
| 17 | +//! - Alternatively, you can install dial9 but leave recording disabled at runtime via the handle's |
| 18 | +//! [`disable()`](dial9_tokio_telemetry::telemetry::TelemetryHandle::disable). The runtime hooks are installed |
| 19 | +//! but all event writes are no-ops behind a relaxed atomic read. This has slightly more overhead than |
| 20 | +//! [`.enabled(false)`](dial9_tokio_telemetry::Dial9ConfigBuilder::enabled) but lets a background task flip |
| 21 | +//! recording on from dynamic configuration later. It is a larger surface area of code, so it is higher risk. |
| 22 | +//! |
| 23 | +//! > Note! dial9 must be created _before_ your async runtime. dial9 relies on installing itself into the runtime |
| 24 | +//! > telemetry hooks to produce Tokio events. |
| 25 | +//! |
| 26 | +//! ### The overhead of running dial9 |
| 27 | +//! |
| 28 | +//! 1. Dial9 allocates a 1MB buffer for each thread that records events. If you are recording events from a huge number of threads, this can bloat memory. |
| 29 | +//! 2. When Tokio telemetry is enabled, 2 dial9 events will be emitted by every poll. If your poll times are extremely short and your application is CPU bound then this overhead can be significant. |
| 30 | +//! |
| 31 | +//! Dial9 has many possible components you can enable. The more components, the more data you will produce and the more overhead your application will have. |
| 32 | +//! |
| 33 | +//! #### CPU Profiling |
| 34 | +//! Dial9 has two types of CPU profiling available: |
| 35 | +//! 1. CPU profiling (this is what you would normally consider CPU profiling): Dial9 is sampling stack traces that are running on the CPU. |
| 36 | +//! 2. Schedule Profiling: dial9 subscribes to the sched-switch linux kernel event and can capture a stack trace when your application is descheduled by the kernel. In order |
| 37 | +//! to subscribe to these events, dial9 must open one perf fd per worker thread. |
| 38 | +//! |
| 39 | +//! #### Tokio Telemetry |
| 40 | +//! The basic Tokio telemetry will install a few hooks: |
| 41 | +//! 1. `on_before_poll` and `on_after_poll` callbacks: This have a nanosecond level of overhead on each poll, just from the dynamic dispatch. |
| 42 | +//! 2. `on_worker_park/unpark`: When these events happen, dial9 reads a few kernel APIs (~1us) to help to understand how Tokio is interacting with the OS. |
| 43 | +//! |
| 44 | +//! When you use dial9's spawn method, your futures are wrapped in a future that tracks `wake` events. This is done by instrumenting the waker. This is optional but |
| 45 | +//! allows you to understand scheduling delay of the tasks you are running. |
| 46 | +//! |
| 47 | +//! #### Tracing |
| 48 | +//! Dial9 can capture Tracing spans via the `TracingLayer`. On the scale of tracing, this is fairly low overhead, however, if you have a large amount of deeply nested spans, this can produce a huge amount |
| 49 | +//! of data. We recommend using a very fine-grained filter. |
| 50 | +//! |
| 51 | +//! The rest of this example shows an opinionated way to wire dial9 |
| 52 | +//! so it can be enabled and tuned with CLI flags or environment variables |
| 53 | +//! (via [`clap`]). The same binary can then run in dev, staging, and prod |
| 54 | +//! with different tracing behavior, and tooling (CDK, Docker, k8s, etc.) |
| 55 | +//! can flip knobs without a rebuild. |
| 56 | +//! |
| 57 | +//! ### Getting Useful Data |
| 58 | +//! |
| 59 | +//! To get the most use out of dial9, you need application-specific events in your traces to make sense of your data. The best way to do this is to emit some sort |
| 60 | +//! of request id into your dial9 traces. There are a couple of ways to do this: |
| 61 | +//! |
| 62 | +//! 1. Use the `Dial9TokioLayer`, which is a tracing layer that allows dial9 to capture your tracing spans directly. Typically, you will |
| 63 | +//! select a narrow set of top-level spans to track. |
| 64 | +//! 2. Emit an event when your request starts and when your request stops. Because these should _normally_ always be on the same Tokio Task, we can |
| 65 | +//! correlate post-hoc to figure exactly which polls belonged to which requests. |
| 66 | +//! |
| 67 | +//! # Configuration (CLI flags / environment variables) |
| 68 | +//! |
| 69 | +//! Every option can be set as a `--flag` or via its environment variable. |
| 70 | +//! Run with `--help` for full usage. |
| 71 | +//! |
| 72 | +//! | Name | Default | Meaning | |
| 73 | +//! | --------------------------------- | ------------------------------- | ------------------------------------------------------------- | |
| 74 | +//! | `DIAL9_ENABLED` | `false` | Master switch. `true`/`false` only (Rust `bool::from_str`). | |
| 75 | +//! | `DIAL9_TRACE_DIR` | `/tmp/dial9-traces` | Directory to write rotated trace segments into. | |
| 76 | +//! | `DIAL9_ROTATION_SECS` | `60` | Wall-clock rotation period in seconds. | |
| 77 | +//! | `DIAL9_MAX_DISK_USAGE_MB` | `1024` | Upper bound on total on-disk bytes (old files evicted). | |
| 78 | +//! | `DIAL9_S3_BUCKET` | unset / empty | When set, sealed segments are gzip-uploaded to this bucket. | |
| 79 | +//! | `DIAL9_SERVICE_NAME` | binary name | Service name used in the S3 key layout (required with S3). | |
| 80 | +//! | `DIAL9_CPU_PROFILE_ENABLED` | `true` on Linux, `false` else | Enables `perf_event_open`-based CPU sampling. | |
| 81 | +//! | `DIAL9_CPU_SAMPLE_HZ` | `99` | Sampling frequency for CPU profiling. | |
| 82 | +//! | `DIAL9_SCHEDULE_PROFILE_ENABLED` | `true` on Linux, `false` else | Enables per-worker scheduler event capture (context switches).| |
| 83 | +//! |
| 84 | +//! # Invalid configuration |
| 85 | +//! |
| 86 | +//! Invalid operator input for the knobs above (unknown boolean, non-numeric |
| 87 | +//! duration, etc.) is caught by `clap` and exits with a diagnostic, as it |
| 88 | +//! would for any misconfigured CLI tool. Invalid _dial9_ configuration |
| 89 | +//! (writer I/O failure, unwritable trace directory, etc.) is handled |
| 90 | +//! lazily by [`dial9_tokio_telemetry::Dial9ConfigBuilder::build_or_disabled`]: the builder logs |
| 91 | +//! the error and falls back to a plain tokio runtime with telemetry |
| 92 | +//! disabled. A bad trace config must never take down prod. |
| 93 | +//! |
| 94 | +//! # Running the example |
| 95 | +//! |
| 96 | +//! ```sh |
| 97 | +//! # plain run: telemetry disabled |
| 98 | +//! cargo run --example production_use |
| 99 | +//! |
| 100 | +//! # basic local tracing (env var) |
| 101 | +//! DIAL9_ENABLED=true cargo run --example production_use |
| 102 | +//! |
| 103 | +//! # basic local tracing (CLI flag) |
| 104 | +//! cargo run --example production_use -- --enabled |
| 105 | +//! |
| 106 | +//! # with CPU profiling + schedule events (Linux, requires feature flag) |
| 107 | +//! DIAL9_ENABLED=true \ |
| 108 | +//! cargo run --features cpu-profiling --example production_use |
| 109 | +//! |
| 110 | +//! # with S3 upload (requires feature flag and AWS creds in env) |
| 111 | +//! cargo run --features worker-s3 --example production_use -- \ |
| 112 | +//! --enabled --s3-bucket my-trace-bucket --service-name my-service |
| 113 | +//! ``` |
| 114 | +
|
| 115 | +use std::time::Duration; |
| 116 | + |
| 117 | +use clap::Parser; |
| 118 | +use dial9_tokio_telemetry::Dial9Config; |
| 119 | +use dial9_tokio_telemetry::telemetry::TelemetryHandle; |
| 120 | + |
| 121 | +const LINUX: bool = cfg!(target_os = "linux"); |
| 122 | + |
| 123 | +/// Opinionated configuration for a production dial9 deployment. |
| 124 | +/// |
| 125 | +/// All fields can be set via environment variables (shown in help) or CLI flags. |
| 126 | +#[derive(Debug, Parser)] |
| 127 | +struct Dial9Opts { |
| 128 | + /// Master switch for dial9 telemetry. |
| 129 | + #[arg(long, env = "DIAL9_ENABLED", default_value_t = false)] |
| 130 | + enabled: bool, |
| 131 | + |
| 132 | + /// Directory to write rotated trace segments into. |
| 133 | + #[arg(long, env = "DIAL9_TRACE_DIR", default_value = "/tmp/dial9-traces")] |
| 134 | + trace_dir: String, |
| 135 | + |
| 136 | + /// Wall-clock rotation period in seconds. |
| 137 | + #[arg(long, env = "DIAL9_ROTATION_SECS", default_value_t = 60)] |
| 138 | + rotation_secs: u64, |
| 139 | + |
| 140 | + /// Upper bound on total on-disk usage in MiB (old files evicted). |
| 141 | + #[arg(long, env = "DIAL9_MAX_DISK_USAGE_MB", default_value_t = 1024)] |
| 142 | + max_disk_usage_mb: u64, |
| 143 | + |
| 144 | + /// S3 bucket for uploading sealed segments (requires --service-name). |
| 145 | + #[arg(long, env = "DIAL9_S3_BUCKET", requires = "service_name")] |
| 146 | + s3_bucket: Option<String>, |
| 147 | + |
| 148 | + /// Service name used in the S3 key layout. |
| 149 | + #[arg(long, env = "DIAL9_SERVICE_NAME")] |
| 150 | + service_name: Option<String>, |
| 151 | + |
| 152 | + /// Enable perf_event_open-based CPU sampling (Linux only). |
| 153 | + #[arg(long, env = "DIAL9_CPU_PROFILE_ENABLED", default_value_t = LINUX)] |
| 154 | + cpu_profile_enabled: bool, |
| 155 | + |
| 156 | + /// Sampling frequency for CPU profiling. |
| 157 | + #[arg(long, env = "DIAL9_CPU_SAMPLE_HZ", default_value_t = 99)] |
| 158 | + cpu_sample_hz: u64, |
| 159 | + |
| 160 | + /// Enable per-worker scheduler event capture (Linux only). |
| 161 | + #[arg(long, env = "DIAL9_SCHEDULE_PROFILE_ENABLED", default_value_t = LINUX)] |
| 162 | + schedule_profile_enabled: bool, |
| 163 | +} |
| 164 | + |
| 165 | +impl Dial9Opts { |
| 166 | + fn rotation(&self) -> Duration { |
| 167 | + Duration::from_secs(self.rotation_secs) |
| 168 | + } |
| 169 | + |
| 170 | + fn max_disk_usage_bytes(&self) -> u64 { |
| 171 | + self.max_disk_usage_mb.saturating_mul(1024 * 1024) |
| 172 | + } |
| 173 | +} |
| 174 | + |
| 175 | +/// Translate parsed options into a [`Dial9Config`] the `#[main]` macro can consume. |
| 176 | +/// |
| 177 | +/// `build_or_disabled()` is the important piece here: any writer I/O or |
| 178 | +/// validation failure (unwritable `trace_dir`, zero-sized budget, etc.) |
| 179 | +/// logs an error and returns a pass-through config that builds a plain |
| 180 | +/// tokio runtime. In both the disabled and fallback cases, |
| 181 | +/// `TelemetryHandle::current()` returns an inert handle and |
| 182 | +/// `handle.spawn` delegates to `tokio::spawn`, so application code does |
| 183 | +/// not need to branch on whether dial9 is running. |
| 184 | +fn configure_dial9(opts: &Dial9Opts) -> Dial9Config { |
| 185 | + if opts.enabled |
| 186 | + && let Err(e) = std::fs::create_dir_all(&opts.trace_dir) |
| 187 | + { |
| 188 | + eprintln!("warning: could not create {}: {e}", opts.trace_dir); |
| 189 | + } |
| 190 | + |
| 191 | + let base_path = format!("{}/trace.bin", opts.trace_dir.trim_end_matches('/')); |
| 192 | + let max_disk = opts.max_disk_usage_bytes(); |
| 193 | + let max_file_size = (max_disk / 4).max(16 * 1024 * 1024); |
| 194 | + |
| 195 | + if opts.enabled { |
| 196 | + warn_if_feature_missing(opts); |
| 197 | + } |
| 198 | + |
| 199 | + #[cfg(feature = "cpu-profiling")] |
| 200 | + let (cpu_enabled, sched_enabled, cpu_hz) = ( |
| 201 | + opts.cpu_profile_enabled, |
| 202 | + opts.schedule_profile_enabled, |
| 203 | + opts.cpu_sample_hz, |
| 204 | + ); |
| 205 | + #[cfg(feature = "worker-s3")] |
| 206 | + let (s3_bucket, s3_service) = (opts.s3_bucket.clone(), opts.service_name.clone()); |
| 207 | + |
| 208 | + Dial9Config::builder() |
| 209 | + .enabled(opts.enabled) |
| 210 | + .base_path(base_path) |
| 211 | + .max_file_size(max_file_size) |
| 212 | + .max_total_size(max_disk) |
| 213 | + .rotation_period(opts.rotation()) |
| 214 | + .with_runtime(move |mut r| { |
| 215 | + r = r.with_task_tracking(true); |
| 216 | + #[cfg(feature = "cpu-profiling")] |
| 217 | + { |
| 218 | + use dial9_tokio_telemetry::telemetry::cpu_profile::{ |
| 219 | + CpuProfilingConfig, SchedEventConfig, |
| 220 | + }; |
| 221 | + if cpu_enabled { |
| 222 | + r = r.with_cpu_profiling(CpuProfilingConfig::default().frequency_hz(cpu_hz)); |
| 223 | + } |
| 224 | + if sched_enabled { |
| 225 | + r = r.with_sched_events(SchedEventConfig::default()); |
| 226 | + } |
| 227 | + } |
| 228 | + #[cfg(feature = "worker-s3")] |
| 229 | + { |
| 230 | + use dial9_tokio_telemetry::background_task::s3::S3Config; |
| 231 | + if let (Some(bucket), Some(service_name)) = (&s3_bucket, &s3_service) { |
| 232 | + let s3 = S3Config::builder() |
| 233 | + .bucket(bucket.clone()) |
| 234 | + .service_name(service_name.clone()) |
| 235 | + .build(); |
| 236 | + r = r.with_s3_uploader(s3); |
| 237 | + } |
| 238 | + } |
| 239 | + r |
| 240 | + }) |
| 241 | + .build_or_disabled() |
| 242 | +} |
| 243 | + |
| 244 | +/// Complain at startup when the operator asked for something a feature flag |
| 245 | +/// disables, so silent misconfiguration doesn't go unnoticed. |
| 246 | +fn warn_if_feature_missing(opts: &Dial9Opts) { |
| 247 | + if cfg!(not(feature = "cpu-profiling")) |
| 248 | + && (opts.cpu_profile_enabled || opts.schedule_profile_enabled) |
| 249 | + { |
| 250 | + eprintln!( |
| 251 | + "warning: cpu/schedule profiling requested but --features cpu-profiling not enabled; ignoring" |
| 252 | + ); |
| 253 | + } |
| 254 | + if cfg!(not(feature = "worker-s3")) && opts.s3_bucket.is_some() { |
| 255 | + eprintln!( |
| 256 | + "warning: DIAL9_S3_BUCKET set but --features worker-s3 not enabled; traces only on local disk" |
| 257 | + ); |
| 258 | + } |
| 259 | +} |
| 260 | + |
| 261 | +fn my_config() -> Dial9Config { |
| 262 | + let opts = Dial9Opts::parse(); |
| 263 | + eprintln!( |
| 264 | + "dial9 telemetry: {}", |
| 265 | + if opts.enabled { |
| 266 | + format!("enabled, writing to {}", opts.trace_dir) |
| 267 | + } else { |
| 268 | + "disabled (set DIAL9_ENABLED=true or --enabled to enable)".into() |
| 269 | + } |
| 270 | + ); |
| 271 | + configure_dial9(&opts) |
| 272 | +} |
| 273 | + |
| 274 | +async fn workload_task(id: usize) { |
| 275 | + for i in 0..5 { |
| 276 | + tokio::time::sleep(Duration::from_millis(5)).await; |
| 277 | + if i == 0 && id.is_multiple_of(25) { |
| 278 | + println!("task {id} working"); |
| 279 | + } |
| 280 | + } |
| 281 | +} |
| 282 | + |
| 283 | +#[dial9_tokio_telemetry::main(config = my_config)] |
| 284 | +async fn main() { |
| 285 | + let handle = TelemetryHandle::current(); |
| 286 | + let tasks: Vec<_> = (0..100).map(|i| handle.spawn(workload_task(i))).collect(); |
| 287 | + for task in tasks { |
| 288 | + let _ = task.await; |
| 289 | + } |
| 290 | + println!("workload finished"); |
| 291 | +} |
0 commit comments