Skip to content

Commit f8a3f21

Browse files
committed
Add CPU profiling and scheduler event tracking
Integrates perf-self-profile into the telemetry recorder to enable: - CPU sampling with configurable frequency - Context-switch capture via perf_event_open - Callframe symbolication with source location - Binary format extensions for CPU samples and callframe definitions - Flamegraph visualization in trace viewer Key changes: - New cpu_profile.rs: CpuProfiler and SchedProfiler wrappers - Recorder: thread lifecycle hooks, profiler drain in flush loop, callframe interning - Format: new event types (CpuSample, CallframeDef) with backward compatibility - trace_viewer.html: flamegraph panel with drag-and-drop support - Updated SymbolInfo usage to match perf-self-profile API (code_info instead of object) Tests pass. Ready for review.
1 parent 10e632c commit f8a3f21

18 files changed

Lines changed: 2362 additions & 29 deletions

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

dial9-tokio-telemetry/Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ pin-project-lite = "0.2"
1313
serde = { version = "1", features = ["derive"] }
1414
serde_json = "1"
1515
smallvec = "1"
16+
perf-self-profile = { path = "../perf-self-profile", optional = true }
17+
18+
[features]
19+
cpu-profiling = ["perf-self-profile"]
20+
21+
default = ["cpu-profiling"]
1622

1723
[dev-dependencies]
1824
assert2 = "0.3"
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Tokio Telemetry Trace Analysis Guide
2+
3+
## Quick Start
4+
5+
```bash
6+
# 1. Get basic statistics
7+
cargo run --example analyze_trace -- trace.0.bin
8+
9+
# 2. Convert to JSONL for custom analysis
10+
cargo run --example trace_to_jsonl -- trace.0.bin trace.jsonl
11+
12+
# 3. View CPU profile
13+
cargo run --example cpu_flamegraph -- trace.0.bin > flamegraph.txt
14+
```
15+
16+
## Understanding the Output
17+
18+
### Key Metrics from `analyze_trace`
19+
20+
- **Wake→Poll delays**: Time from when a task is woken until it runs
21+
- High p99 (>10ms) indicates scheduling issues
22+
- Check local queue depths when delays are high
23+
- **Poll durations**: Time spent executing task code
24+
- High values indicate CPU-bound work
25+
- **Worker utilization**: Busy vs parked time
26+
- High utilization + high delays = queue buildup
27+
28+
### Spawn Locations
29+
30+
Tasks are grouped by where they were spawned:
31+
- Look for locations with many polls and high average poll time
32+
- These consume the most CPU
33+
34+
## Diagnosing High Latency
35+
36+
### Step 1: Identify the Problem Tasks
37+
38+
Find tasks with high wake→poll delays:
39+
40+
```python
41+
# Find tasks with >10ms delays
42+
import json
43+
from collections import defaultdict
44+
45+
task_wakes = defaultdict(list)
46+
task_polls = defaultdict(list)
47+
48+
with open('trace.jsonl') as f:
49+
for line in f:
50+
e = json.loads(line)
51+
if e['event'] == 'WakeEvent':
52+
task_wakes[e['woken_task_id']].append(e['timestamp_ns'])
53+
elif e['event'] == 'PollStart':
54+
task_polls[e['task_id']].append({
55+
'ts': e['timestamp_ns'],
56+
'worker': e['worker'],
57+
'loc': e.get('spawn_loc_id'),
58+
'local_q': e['local_q']
59+
})
60+
61+
# Match wakes to polls and find delays
62+
for task_id in task_polls:
63+
wakes = sorted(task_wakes.get(task_id, []))
64+
polls = sorted(task_polls[task_id], key=lambda x: x['ts'])
65+
# ... calculate delays
66+
```
67+
68+
**Key insight**: Check `local_q` (local queue depth) when delays are high. Values >20 indicate queue buildup.
69+
70+
### Step 2: Find What's Blocking
71+
72+
For each high-delay period, count what the worker was doing:
73+
74+
```python
75+
# During delay period (wake_ts to poll_ts on worker W):
76+
# - Count PollStart events on that worker
77+
# - Group by spawn_loc_id to see which task types dominated
78+
```
79+
80+
**Pattern**: If worker polled 300+ tasks during a 100ms delay, it was busy with other work.
81+
82+
### Step 3: Identify CPU Hotspots
83+
84+
Use CPU samples to see what code was running during blocking:
85+
86+
```rust
87+
// See examples/blocked_analysis.rs for full implementation
88+
// 1. Find blocked periods (wake→poll > threshold)
89+
// 2. Filter CPU samples to those time ranges on those workers
90+
// 3. Aggregate stack traces into flamegraph format
91+
```
92+
93+
**Look for**:
94+
- Expensive operations (crypto, parsing, serialization)
95+
- Specific libraries dominating samples
96+
- Unexpected synchronous work
97+
98+
## Common Patterns
99+
100+
### Pattern 1: Local Queue Buildup
101+
102+
**Symptoms**:
103+
- High wake→poll delays (>50ms p99)
104+
- High local queue depths (>30)
105+
- Workers are busy (>70% utilization)
106+
- Global queue depth is 0
107+
108+
**Cause**: Tasks spawned on one worker stay there, creating imbalance.
109+
110+
**Solution**:
111+
- Reduce task spawn rate
112+
- Use `spawn_blocking` for CPU-heavy work
113+
- Tune `global_queue_interval` to force work distribution
114+
115+
### Pattern 2: Priority Inversion
116+
117+
**Symptoms**:
118+
- High-priority tasks (e.g., request handlers) have high delays
119+
- CPU samples show low-priority work (e.g., background clients)
120+
- Specific spawn locations dominate CPU during blocking
121+
122+
**Cause**: FIFO local queues don't prioritize important work.
123+
124+
**Solution**:
125+
- Separate runtimes for different priority levels
126+
- Reduce concurrency of background tasks
127+
- Use semaphores to limit concurrent background work
128+
129+
### Pattern 3: CPU-Bound Tasks
130+
131+
**Symptoms**:
132+
- High average poll durations (>500µs)
133+
- CPU samples concentrated in specific functions
134+
- Low wake→poll delays
135+
136+
**Cause**: Tasks doing too much work per poll.
137+
138+
**Solution**:
139+
- Move expensive work to `spawn_blocking`
140+
- Add yield points in long-running tasks
141+
- Optimize hot functions shown in CPU samples
142+
143+
## Useful Scripts
144+
145+
### Find Tasks by Spawn Location
146+
147+
```bash
148+
# Get spawn location mapping
149+
cargo run --example analyze_trace -- trace.0.bin 2>&1 | grep "=== Spawn Locations"
150+
```
151+
152+
### Aggregate CPU by Function
153+
154+
```bash
155+
# Get flamegraph-style output
156+
cargo run --example cpu_flamegraph -- trace.0.bin | \
157+
grep "metrics_service\|aws_" | \
158+
head -20
159+
```
160+
161+
### Check Worker Balance
162+
163+
```python
164+
# Count polls per worker
165+
import json
166+
from collections import Counter
167+
168+
worker_polls = Counter()
169+
with open('trace.jsonl') as f:
170+
for line in f:
171+
e = json.loads(line)
172+
if e['event'] == 'PollStart':
173+
worker_polls[e['worker']] += 1
174+
175+
# Should be roughly equal (within 10%)
176+
```
177+
178+
## Tips
179+
180+
1. **Start with `analyze_trace`** - gives you the big picture
181+
2. **Check wake→poll delays first** - usually the root cause of p99 issues
182+
3. **Local queue depth matters** - values >20 indicate problems
183+
4. **CPU samples are sparse** - you need many blocked periods to get good data
184+
5. **Match spawn locations to code** - use line numbers to find the actual code
185+
6. **Workers should be balanced** - uneven poll counts indicate work stealing issues
186+
187+
## Example Analysis Workflow
188+
189+
```bash
190+
# 1. Get overview
191+
cargo run --example analyze_trace -- trace.0.bin
192+
193+
# 2. See high wake→poll delays? Check local queues
194+
# Write script to correlate delays with queue depths
195+
196+
# 3. Find what's blocking
197+
# Write script to count polls during high-delay periods
198+
199+
# 4. Get CPU profile of blocking work
200+
cargo run --example blocked_analysis
201+
202+
# 5. Identify root cause from CPU samples
203+
# Look for expensive operations or unexpected work
204+
```

dial9-tokio-telemetry/benches/overhead_bench.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@
1414
//! Duration defaults to 30 seconds. A 3-second warmup precedes measurement.
1515
1616
use dial9_tokio_telemetry::telemetry::{
17-
NullWriter, SimpleBinaryWriter, TelemetryGuard, TelemetryHandle, TracedRuntime,
17+
CpuProfilingConfig, NullWriter, SimpleBinaryWriter, TelemetryGuard, TelemetryHandle,
18+
TracedRuntime,
1819
};
1920
use hdrhistogram::Histogram;
2021
use std::sync::Arc;
@@ -174,6 +175,8 @@ fn main() {
174175
Box::new(SimpleBinaryWriter::new("/tmp/overhead_bench_trace.bin").unwrap());
175176
let (rt, g) = TracedRuntime::builder()
176177
.with_task_tracking(true)
178+
.with_cpu_profiling(CpuProfilingConfig::default())
179+
.with_inline_callframe_symbols(true)
177180
.build_and_start(builder, writer)
178181
.unwrap();
179182
(rt, Some(g))

0 commit comments

Comments
 (0)