Add dial9-in-prod example#335
Conversation
The previous commit renamed the README heading `## Wake event tracking` to `## Wake event tracking (opt-in)` (to match the sibling `## Tracing span events (opt-in)`). `dial9-viewer/build.rs` matches section headings exactly via `SETUP_SECTIONS` and panics loudly if one is renamed, which broke clippy, ubuntu-nightly build, docs.rs check, and cargo package CI jobs. Update the constant to the new heading.
- Fix env var table: DIAL9_ENABLED only accepts true/false (not 1/yes) since parse_env uses bool::from_str - Fix README typo: 'read to be polled' → 'ready to be polled' - Fix doc comment typo: 'an even' → 'an event' - Fix duplicate 'a' across line break in doc comment
jlizen
left a comment
There was a problem hiding this comment.
Content seems fine overall, though might be worth waiting on #256 landing since that opens up some nicer "bad startup" handling. Seems like it's pretty close. No worries if you'd rather get this out tomorrow, I believe @Fluzko is AFK until Monday.
But, I do think we should probably avoid the handrolled env parsing in favor of using Clap. Feel free to disagree, in which case I can stamp this as-is (and up to you about the other small content tweaks).
| //! install the runtime hooks but they will all be no-ops. You could then set up a background task that reads dynamic configuration to enable dial9 later. This is a much larger surface area of code | ||
| //! that is enabled, so it is higher risk. | ||
| //! | ||
| //! > Note! dial9 must be created _before_ your async runtime. dial9 relies on installing itself into the runtime telemetry hooks to produce Tokio events. |
There was a problem hiding this comment.
Should we mention that using #[dial9::main] gets you this for free?
| //! | ||
| //! ### The overhead of running dial9 | ||
| //! | ||
| //! 1. Dial9 allocates a 1MB buffer for each thread that record events. If you are recording events from a huge number of threads, this can bloat memory. |
There was a problem hiding this comment.
Does this risk getting stale if we state the specific size? Should we be more vague?
| //! Dial9 has two types of CPU profiling available: | ||
| //! 1. CPU profiling (this is what you would normally consider CPU profiling): Dial9 is sampling stack traces that are running on the CPU. | ||
| //! 2. Schedule Profiling: dial9 subscribes to the sched-switch linux kernel event and can capture a stack trace when your application is descheduled by the kernel. In order | ||
| //! to subscribe to these events, dial9 must open one perf fd per worker thread. |
| //! | ||
| //! Dial9 has many possible components you can enable. The more components, the more data you will produce and the more overhead your application will have. | ||
| //! | ||
| //! #### CPU Profiling |
There was a problem hiding this comment.
i would have expected some mention of permissions here (the implications of the available paranoid settings needed to access these)
| //! | ||
| //! #### Tracing | ||
| //! Dial9 can capture Tracing spans via the `TracingLayer`. On the scale of tracing, this is fairly low overhead, however, if you have a large amount of deeply nested spans, this can produce a huge amount | ||
| //! of data. We recommend using a very fine-grained filter. |
There was a problem hiding this comment.
would it be worth including a sample filter (perhaps filtering aws sdk stuff), since we don't have the telemetry in the example?
|
|
||
| let base_path = format!("{}/trace.bin", opts.trace_dir.trim_end_matches('/')); | ||
|
|
||
| let max_file_size = (opts.max_disk_usage_bytes / 4).max(16 * 1024 * 1024); |
There was a problem hiding this comment.
Looking forward to getting rid of this
| return Dial9ConfigBuilder::disabled().build(); | ||
| } | ||
|
|
||
| // Make sure the trace directory exists; the writer errors out otherwise. |
There was a problem hiding this comment.
This is needed just because build() will panic on failure currently?
| /// Parse environment variables into a [`Dial9Opts`]. | ||
| /// | ||
| /// Pure — no side effects, no panics, no logging. | ||
| fn from_env() -> Result<Self, Dial9EnvError> { |
There was a problem hiding this comment.
The sheer volume of parsing code makes this example more noisy than it needs to be. In practice I would probably use Clap with its envsupport for this... Pretty sure even the linux thing can be supported inline with default_value_t, else a simple enough option + unwrap_or.
WDYT about cutting over to stay focused on the business logic and opinions?
There was a problem hiding this comment.
yeah that seems reasonable. Or we can just "leave it as an exercise to the reader"
I'm thinking this would actually work nicely as a _guide module instead of an example
Resolve README.md conflicts by taking main's restructured layout and adding only the production_use example pointer paragraph.
…builder API Drop `try_current` and `parse_or_fallback` now that `build_or_disabled` returns a pass-through config on writer failures and `TelemetryHandle::current` is inert when disabled, so application code runs unchanged whether dial9 is recording or not. Fold the four cfg-gated apply_* helpers into a single `with_runtime` closure.
I think we might hoist this into the readme at some point? for now, its linked in the readme
fixes #310