|
| 1 | +# Tokio Telemetry Trace Analysis Guide |
| 2 | + |
| 3 | +## Quick Start |
| 4 | + |
| 5 | +```bash |
| 6 | +# 1. Get basic statistics |
| 7 | +cargo run --example analyze_trace -- trace.0.bin |
| 8 | + |
| 9 | +# 2. Convert to JSONL for custom analysis |
| 10 | +cargo run --example trace_to_jsonl -- trace.0.bin trace.jsonl |
| 11 | + |
| 12 | +# 3. View CPU profile |
| 13 | +cargo run --example cpu_flamegraph -- trace.0.bin > flamegraph.txt |
| 14 | +``` |
| 15 | + |
| 16 | +## Understanding the Output |
| 17 | + |
| 18 | +### Key Metrics from `analyze_trace` |
| 19 | + |
| 20 | +- **Wake→Poll delays**: Time from when a task is woken until it runs |
| 21 | + - High p99 (>10ms) indicates scheduling issues |
| 22 | + - Check local queue depths when delays are high |
| 23 | +- **Poll durations**: Time spent executing task code |
| 24 | + - High values indicate CPU-bound work |
| 25 | +- **Worker utilization**: Busy vs parked time |
| 26 | + - High utilization + high delays = queue buildup |
| 27 | + |
| 28 | +### Spawn Locations |
| 29 | + |
| 30 | +Tasks are grouped by where they were spawned: |
| 31 | +- Look for locations with many polls and high average poll time |
| 32 | +- These consume the most CPU |
| 33 | + |
| 34 | +## Diagnosing High Latency |
| 35 | + |
| 36 | +### Step 1: Identify the Problem Tasks |
| 37 | + |
| 38 | +Find tasks with high wake→poll delays: |
| 39 | + |
| 40 | +```python |
| 41 | +# Find tasks with >10ms delays |
| 42 | +import json |
| 43 | +from collections import defaultdict |
| 44 | + |
| 45 | +task_wakes = defaultdict(list) |
| 46 | +task_polls = defaultdict(list) |
| 47 | + |
| 48 | +with open('trace.jsonl') as f: |
| 49 | + for line in f: |
| 50 | + e = json.loads(line) |
| 51 | + if e['event'] == 'WakeEvent': |
| 52 | + task_wakes[e['woken_task_id']].append(e['timestamp_ns']) |
| 53 | + elif e['event'] == 'PollStart': |
| 54 | + task_polls[e['task_id']].append({ |
| 55 | + 'ts': e['timestamp_ns'], |
| 56 | + 'worker': e['worker'], |
| 57 | + 'loc': e.get('spawn_loc_id'), |
| 58 | + 'local_q': e['local_q'] |
| 59 | + }) |
| 60 | + |
| 61 | +# Match wakes to polls and find delays |
| 62 | +for task_id in task_polls: |
| 63 | + wakes = sorted(task_wakes.get(task_id, [])) |
| 64 | + polls = sorted(task_polls[task_id], key=lambda x: x['ts']) |
| 65 | + # ... calculate delays |
| 66 | +``` |
| 67 | + |
| 68 | +**Key insight**: Check `local_q` (local queue depth) when delays are high. Values >20 indicate queue buildup. |
| 69 | + |
| 70 | +### Step 2: Find What's Blocking |
| 71 | + |
| 72 | +For each high-delay period, count what the worker was doing: |
| 73 | + |
| 74 | +```python |
| 75 | +# During delay period (wake_ts to poll_ts on worker W): |
| 76 | +# - Count PollStart events on that worker |
| 77 | +# - Group by spawn_loc_id to see which task types dominated |
| 78 | +``` |
| 79 | + |
| 80 | +**Pattern**: If worker polled 300+ tasks during a 100ms delay, it was busy with other work. |
| 81 | + |
| 82 | +### Step 3: Identify CPU Hotspots |
| 83 | + |
| 84 | +Use CPU samples to see what code was running during blocking: |
| 85 | + |
| 86 | +```rust |
| 87 | +// See examples/blocked_analysis.rs for full implementation |
| 88 | +// 1. Find blocked periods (wake→poll > threshold) |
| 89 | +// 2. Filter CPU samples to those time ranges on those workers |
| 90 | +// 3. Aggregate stack traces into flamegraph format |
| 91 | +``` |
| 92 | + |
| 93 | +**Look for**: |
| 94 | +- Expensive operations (crypto, parsing, serialization) |
| 95 | +- Specific libraries dominating samples |
| 96 | +- Unexpected synchronous work |
| 97 | + |
| 98 | +## Common Patterns |
| 99 | + |
| 100 | +### Pattern 1: Local Queue Buildup |
| 101 | + |
| 102 | +**Symptoms**: |
| 103 | +- High wake→poll delays (>50ms p99) |
| 104 | +- High local queue depths (>30) |
| 105 | +- Workers are busy (>70% utilization) |
| 106 | +- Global queue depth is 0 |
| 107 | + |
| 108 | +**Cause**: Tasks spawned on one worker stay there, creating imbalance. |
| 109 | + |
| 110 | +**Solution**: |
| 111 | +- Reduce task spawn rate |
| 112 | +- Use `spawn_blocking` for CPU-heavy work |
| 113 | +- Tune `global_queue_interval` to force work distribution |
| 114 | + |
| 115 | +### Pattern 2: Priority Inversion |
| 116 | + |
| 117 | +**Symptoms**: |
| 118 | +- High-priority tasks (e.g., request handlers) have high delays |
| 119 | +- CPU samples show low-priority work (e.g., background clients) |
| 120 | +- Specific spawn locations dominate CPU during blocking |
| 121 | + |
| 122 | +**Cause**: FIFO local queues don't prioritize important work. |
| 123 | + |
| 124 | +**Solution**: |
| 125 | +- Separate runtimes for different priority levels |
| 126 | +- Reduce concurrency of background tasks |
| 127 | +- Use semaphores to limit concurrent background work |
| 128 | + |
| 129 | +### Pattern 3: CPU-Bound Tasks |
| 130 | + |
| 131 | +**Symptoms**: |
| 132 | +- High average poll durations (>500µs) |
| 133 | +- CPU samples concentrated in specific functions |
| 134 | +- Low wake→poll delays |
| 135 | + |
| 136 | +**Cause**: Tasks doing too much work per poll. |
| 137 | + |
| 138 | +**Solution**: |
| 139 | +- Move expensive work to `spawn_blocking` |
| 140 | +- Add yield points in long-running tasks |
| 141 | +- Optimize hot functions shown in CPU samples |
| 142 | + |
| 143 | +## Useful Scripts |
| 144 | + |
| 145 | +### Find Tasks by Spawn Location |
| 146 | + |
| 147 | +```bash |
| 148 | +# Get spawn location mapping |
| 149 | +cargo run --example analyze_trace -- trace.0.bin 2>&1 | grep "=== Spawn Locations" |
| 150 | +``` |
| 151 | + |
| 152 | +### Aggregate CPU by Function |
| 153 | + |
| 154 | +```bash |
| 155 | +# Get flamegraph-style output |
| 156 | +cargo run --example cpu_flamegraph -- trace.0.bin | \ |
| 157 | + grep "metrics_service\|aws_" | \ |
| 158 | + head -20 |
| 159 | +``` |
| 160 | + |
| 161 | +### Check Worker Balance |
| 162 | + |
| 163 | +```python |
| 164 | +# Count polls per worker |
| 165 | +import json |
| 166 | +from collections import Counter |
| 167 | + |
| 168 | +worker_polls = Counter() |
| 169 | +with open('trace.jsonl') as f: |
| 170 | + for line in f: |
| 171 | + e = json.loads(line) |
| 172 | + if e['event'] == 'PollStart': |
| 173 | + worker_polls[e['worker']] += 1 |
| 174 | + |
| 175 | +# Should be roughly equal (within 10%) |
| 176 | +``` |
| 177 | + |
| 178 | +## Tips |
| 179 | + |
| 180 | +1. **Start with `analyze_trace`** - gives you the big picture |
| 181 | +2. **Check wake→poll delays first** - usually the root cause of p99 issues |
| 182 | +3. **Local queue depth matters** - values >20 indicate problems |
| 183 | +4. **CPU samples are sparse** - you need many blocked periods to get good data |
| 184 | +5. **Match spawn locations to code** - use line numbers to find the actual code |
| 185 | +6. **Workers should be balanced** - uneven poll counts indicate work stealing issues |
| 186 | + |
| 187 | +## Example Analysis Workflow |
| 188 | + |
| 189 | +```bash |
| 190 | +# 1. Get overview |
| 191 | +cargo run --example analyze_trace -- trace.0.bin |
| 192 | + |
| 193 | +# 2. See high wake→poll delays? Check local queues |
| 194 | +# Write script to correlate delays with queue depths |
| 195 | + |
| 196 | +# 3. Find what's blocking |
| 197 | +# Write script to count polls during high-delay periods |
| 198 | + |
| 199 | +# 4. Get CPU profile of blocking work |
| 200 | +cargo run --example blocked_analysis |
| 201 | + |
| 202 | +# 5. Identify root cause from CPU samples |
| 203 | +# Look for expensive operations or unexpected work |
| 204 | +``` |
0 commit comments