Skip to content

Add dial9-in-prod example#335

Merged
jlizen merged 8 commits into
mainfrom
prod-guidance
May 2, 2026
Merged

Add dial9-in-prod example#335
jlizen merged 8 commits into
mainfrom
prod-guidance

Conversation

@rcoh
Copy link
Copy Markdown
Contributor

@rcoh rcoh commented Apr 30, 2026

I think we might hoist this into the readme at some point? for now, its linked in the readme

fixes #310

rcoh added 4 commits April 30, 2026 19:52
The previous commit renamed the README heading `## Wake event tracking`
to `## Wake event tracking (opt-in)` (to match the sibling
`## Tracing span events (opt-in)`). `dial9-viewer/build.rs` matches
section headings exactly via `SETUP_SECTIONS` and panics loudly if one
is renamed, which broke clippy, ubuntu-nightly build, docs.rs check,
and cargo package CI jobs. Update the constant to the new heading.
- Fix env var table: DIAL9_ENABLED only accepts true/false (not 1/yes)
  since parse_env uses bool::from_str
- Fix README typo: 'read to be polled' → 'ready to be polled'
- Fix doc comment typo: 'an even' → 'an event'
- Fix duplicate 'a' across line break in doc comment
Copy link
Copy Markdown
Member

@jlizen jlizen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content seems fine overall, though might be worth waiting on #256 landing since that opens up some nicer "bad startup" handling. Seems like it's pretty close. No worries if you'd rather get this out tomorrow, I believe @Fluzko is AFK until Monday.

But, I do think we should probably avoid the handrolled env parsing in favor of using Clap. Feel free to disagree, in which case I can stamp this as-is (and up to you about the other small content tweaks).

//! install the runtime hooks but they will all be no-ops. You could then set up a background task that reads dynamic configuration to enable dial9 later. This is a much larger surface area of code
//! that is enabled, so it is higher risk.
//!
//! > Note! dial9 must be created _before_ your async runtime. dial9 relies on installing itself into the runtime telemetry hooks to produce Tokio events.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that using #[dial9::main] gets you this for free?

//!
//! ### The overhead of running dial9
//!
//! 1. Dial9 allocates a 1MB buffer for each thread that record events. If you are recording events from a huge number of threads, this can bloat memory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this risk getting stale if we state the specific size? Should we be more vague?

//! Dial9 has two types of CPU profiling available:
//! 1. CPU profiling (this is what you would normally consider CPU profiling): Dial9 is sampling stack traces that are running on the CPU.
//! 2. Schedule Profiling: dial9 subscribes to the sched-switch linux kernel event and can capture a stack trace when your application is descheduled by the kernel. In order
//! to subscribe to these events, dial9 must open one perf fd per worker thread.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting seems off here

//!
//! Dial9 has many possible components you can enable. The more components, the more data you will produce and the more overhead your application will have.
//!
//! #### CPU Profiling
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would have expected some mention of permissions here (the implications of the available paranoid settings needed to access these)

//!
//! #### Tracing
//! Dial9 can capture Tracing spans via the `TracingLayer`. On the scale of tracing, this is fairly low overhead, however, if you have a large amount of deeply nested spans, this can produce a huge amount
//! of data. We recommend using a very fine-grained filter.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth including a sample filter (perhaps filtering aws sdk stuff), since we don't have the telemetry in the example?


let base_path = format!("{}/trace.bin", opts.trace_dir.trim_end_matches('/'));

let max_file_size = (opts.max_disk_usage_bytes / 4).max(16 * 1024 * 1024);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to getting rid of this

return Dial9ConfigBuilder::disabled().build();
}

// Make sure the trace directory exists; the writer errors out otherwise.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed just because build() will panic on failure currently?

/// Parse environment variables into a [`Dial9Opts`].
///
/// Pure — no side effects, no panics, no logging.
fn from_env() -> Result<Self, Dial9EnvError> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sheer volume of parsing code makes this example more noisy than it needs to be. In practice I would probably use Clap with its envsupport for this... Pretty sure even the linux thing can be supported inline with default_value_t, else a simple enough option + unwrap_or.

WDYT about cutting over to stay focused on the business logic and opinions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that seems reasonable. Or we can just "leave it as an exercise to the reader"

I'm thinking this would actually work nicely as a _guide module instead of an example

@rcoh rcoh requested a review from jlizen May 1, 2026 17:59
@jlizen
Copy link
Copy Markdown
Member

jlizen commented May 1, 2026

@rcoh #256 is hitting main which adds the build_or_disabled() options and the nicer TelemetryHandle::current() api, ping me when you are ready for a fresh review

jlizen added 2 commits May 2, 2026 00:11
Resolve README.md conflicts by taking main's restructured layout and
adding only the production_use example pointer paragraph.
…builder API

Drop `try_current` and `parse_or_fallback` now that `build_or_disabled`
returns a pass-through config on writer failures and
`TelemetryHandle::current` is inert when disabled, so application code
runs unchanged whether dial9 is recording or not. Fold the four cfg-gated
apply_* helpers into a single `with_runtime` closure.
@jlizen jlizen enabled auto-merge May 2, 2026 00:29
@jlizen jlizen disabled auto-merge May 2, 2026 09:33
@jlizen jlizen enabled auto-merge May 2, 2026 09:33
@jlizen jlizen added this pull request to the merge queue May 2, 2026
Merged via the queue into main with commit cc4374e May 2, 2026
30 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs] Author a "Running dial9 in production" guide

2 participants