You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert

80
83
81
-
Shows how time to first token varies with request throughput across concurrency levels, helping identify configurations that balance responsiveness with load.
84
+
Shows how time to first token varies with request throughput across concurrency levels. **Potentially useful for finding the sweet spot between responsiveness and capacity**: ideal configurations maintain low TTFT even at high throughput. If TTFT increases sharply at certain throughput levels, this may indicate a prefill bottleneck (batch scheduler contention or compute limitations).
82
85
83
86

84
87
85
-
Highlights optimal configurations on the Pareto frontier that maximize GPU efficiency while minimizing latency.
88
+
Highlights optimal configurations on the Pareto frontier that maximize GPU efficiency while minimizing latency.**Points on the frontier are optimal; points below are suboptimal** configurations. Potentially useful for choosing GPU count and batch sizes to maximize hardware ROI. A steep curve may indicate opportunities to improve latency with minimal throughput loss, while a flat curve can suggest you're near the efficiency limit.
86
89
87
90

88
91
89
-
Shows the trade-off between GPU efficiency and interactivity (TTFT).
92
+
Shows the trade-off between GPU efficiency and interactivity (TTFT).**Potentially useful for determining max concurrency before user experience degrades**: flat regions show where adding concurrency maintains interactivity, while steep sections may indicate diminishing returns. The "knee" of the curve can help identify where throughput gains start to significantly hurt responsiveness.
90
93
91
94
### Single-Run Analysis Mode
92
95
@@ -115,15 +118,19 @@ artifacts/single_run/
115
118
116
119

117
120
118
-
Time to first token for each request, revealing prefill latency patterns and potential warm-up effects.
121
+
Time to first token for each request, revealing prefill latency patterns and potential warm-up effects.**Initial spikes may indicate cold start; stable later values show steady-state performance**. Potentially useful for determining necessary warmup period or identifying warmup configuration issues. Unexpected spikes during steady-state can suggest resource contention, garbage collection pauses, or batch scheduler interference.
119
122
120
123

121
124
122
-
Inter-token latency per request, showing generation performance consistency.
125
+
Inter-token latency per request, showing generation performance consistency.**Consistent ITL may indicate stable generation; variance can suggest batch scheduling issues**. Potentially useful for identifying decode-phase bottlenecks separate from prefill issues. If ITL increases over time, this may indicate KV cache memory pressure or growing batch sizes causing decode slowdown.
123
126
124
127

125
128
126
-
End-to-end latency progression throughout the run.
129
+
End-to-end latency progression throughout the run. **Overall system health check**: ramp-up at the start is normal, but sustained increases may indicate performance degradation. Potentially useful for identifying if your system maintains performance or degrades over time. Sudden jumps may correlate with other requests completing or starting, potentially revealing batch scheduling patterns.
Individual requests plotted as lines spanning their duration from start to end. **Visualizes request scheduling and concurrency patterns**: overlapping lines show concurrent execution, while gaps may indicate scheduling delays. Dense packing can suggest efficient utilization; sparse patterns may suggest underutilized capacity or rate limiting effects.
127
134
128
135
### Dispersed Throughput
129
136
@@ -133,7 +140,9 @@ The **Dispersed Throughput Over Time** plot uses an event-based approach for acc
133
140
134
141
This provides smooth, continuous representation that correlates better with server metrics like GPU utilization.
135
142
136
-

143
+

144
+
145
+
**Smooth ramps may show healthy scaling; drops can indicate bottlenecks**. Potentially useful for correlating with GPU metrics to identify whether bottlenecks are GPU-bound, memory-bound, or CPU-bound. A plateau may indicate you've reached max sustainable throughput for your configuration. Sudden drops can potentially correlate with resource exhaustion or scheduler saturation.
137
146
138
147
## Customization Options
139
148
@@ -277,12 +286,45 @@ The dark theme uses a dark background optimized for presentations while maintain
277
286
278
287

279
288
289
+
## Interactive Dashboard Mode
290
+
291
+
Launch an interactive localhost-hosted dashboard for real-time exploration of profiling data with dynamic metric selection, filtering, and visualization customization.
292
+
293
+
```bash
294
+
# Launch dashboard with default settings (localhost:8050)
295
+
aiperf plot --dashboard
296
+
297
+
# Specify custom port
298
+
aiperf plot --dashboard --port 9000
299
+
300
+
# Launch with dark theme
301
+
aiperf plot --dashboard --theme dark
302
+
303
+
# Specify data paths
304
+
aiperf plot path/to/runs --dashboard
305
+
```
306
+
307
+
**Key Features:**
308
+
-**Dynamic metric switching**: Toggle between avg, p50, p90, p95, p99 statistics in real-time
309
+
-**Run filtering**: Select which runs to display via checkboxes
-**Config viewer**: Click on data points to view full run configuration
312
+
-**Custom plots**: Add new plots with custom axis selections
313
+
-**Plot management**: Hide/show plots dynamically
314
+
-**Export**: Download visible plots as PNG bundle
315
+
316
+
The dashboard automatically detects visualization mode (multi-run comparison or single-run analysis) and displays appropriate tabs and controls. Press Ctrl+C in the terminal to stop the server.
317
+
318
+
> [!TIP]
319
+
> The dashboard runs on localhost only and requires no authentication. For remote access via SSH, use port forwarding: `ssh -L 8080:localhost:8080 user@remote-host`
320
+
321
+
> [!NOTE]
322
+
> Dashboard mode and PNG mode are separate. To generate both static PNGs and launch the dashboard, run the commands separately.
323
+
280
324
## Advanced Features
281
325
282
326
### GPU Telemetry Integration
283
327
284
-
When GPU telemetry is collected (via `--gpu-telemetry` flag during profiling), plots automatically include GPU metrics.
285
-
286
328
**Multi-run plots** (when telemetry available):
287
329
- Token Throughput per GPU vs Latency
288
330
- Token Throughput per GPU vs Interactivity
@@ -293,6 +335,8 @@ When GPU telemetry is collected (via `--gpu-telemetry` flag during profiling), p
293
335
294
336

295
337
338
+
**Correlates compute resources with token generation performance**. High GPU utilization with low throughput may suggest compute-bound workloads (consider optimizing model/batch size). Low utilization with low throughput can indicate bottlenecks elsewhere (KV cache, memory bandwidth, CPU scheduling). Potentially useful for targeting >80% GPU utilization for efficient hardware usage.
339
+
296
340
> [!TIP]
297
341
> See the [GPU Telemetry Tutorial](gpu-telemetry.md) for setup and detailed analysis.
298
342
@@ -306,7 +350,7 @@ When timeslice data is available (via `--slice-duration` during profiling), plot
306
350
- Throughput Across Timeslices
307
351
- Latency Across Timeslices
308
352
309
-
These help identify warm-up effects, performance degradation, and steady-state behavior.
353
+
**Timeslices enable easy outlier identification and bucketing analysis**. Each time window (bucket) shows avg/p50/p95 statistics, making it simple to spot which periods have outlier performance. Slice 0 often shows cold-start overhead, while later slices may reveal degradation. Flat bars across slices may indicate stable performance; increasing trends can suggest resource exhaustion. Potentially useful for quickly isolating performance issues to specific phases (warmup, steady-state, or degradation).
310
354
311
355

0 commit comments