This guide covers all aspects of analyzing your benchmark results, from launching the interactive dashboard to programmatically parsing raw data.
- Overview
- Understanding Output Structure
- Interactive Dashboard
- Command-Line Analysis
- Metrics Deep Dive
- Comparing Experiments
- Exporting Data
- Troubleshooting Analysis
The analysis toolkit provides two primary ways to analyze benchmark results:
| Tool | Use Case | Access |
|---|---|---|
| Interactive Dashboard | Visual exploration, comparing runs, real-time filtering | make dashboard |
| Programmatic Analysis | Automation, custom analysis, CI/CD integration | Python API or raw JSON |
- Dashboard: Ideal for exploring results, comparing configurations, identifying trends, and presenting findings to stakeholders
- Command-Line/Python: Best for automated analysis pipelines, custom metrics, and integration with other tools
Each benchmark run creates a directory under logs/ (or your configured output directory):
logs/
3667_1P_4D_20251110_192145/ # Job ID + Topology + Timestamp
3667.json # Run metadata (required)
sa-bench_isl_1024_osl_1024/ # Profiler results directory
concurrency_16.json # Results for each concurrency level
concurrency_32.json
concurrency_64.json
...
watchtower-navy-cn01_prefill_w0.err # Worker stderr logs
watchtower-navy-cn01_prefill_w0.out # Worker stdout logs
watchtower-navy-cn02_decode_w0.err
watchtower-navy-cn02_decode_w0.out
watchtower-navy-cn01_prefill_config.json # Node configuration snapshots
watchtower-navy-cn02_decode_config.json
.cache/ # Parquet cache files (auto-generated)
benchmark_results.parquet
node_metrics.parquet
The run directory name encodes key information:
{SLURM_JOB_ID}_{TOPOLOGY}_{TIMESTAMP}
| | |
| | +-- YYYYMMDD_HHMMSS
| +-- 1P_4D (1 prefill, 4 decode) or 8A (8 aggregated)
+-- SLURM job identifier
The JSON metadata file is the source of truth for run configuration:
{
"run_metadata": {
"slurm_job_id": "3667",
"run_date": "20251110_192145",
"container": "ghcr.io/sgl-project/sglang:v0.4.1-cu121",
"prefill_nodes": 1,
"decode_nodes": 4,
"prefill_workers": 1,
"decode_workers": 4,
"gpus_per_node": 8,
"gpu_type": "NVIDIA H100 80GB HBM3",
"mode": "disaggregated",
"model_dir": "/models/DeepSeek-V3"
},
"profiler_metadata": {
"type": "sa-bench",
"isl": "1024",
"osl": "1024",
"concurrencies": "16x32x64x128x256"
},
"tags": ["baseline", "h100"]
}Each concurrency level produces a JSON file in the profiler results directory:
{
"max_concurrency": 64,
"output_throughput": 15234.5,
"total_token_throughput": 23456.7,
"request_throughput": 14.8,
"mean_ttft_ms": 245.3,
"mean_tpot_ms": 32.1,
"mean_itl_ms": 31.8,
"mean_e2el_ms": 1456.2,
"median_ttft_ms": 198.4,
"median_tpot_ms": 28.9,
"p99_ttft_ms": 892.1,
"p99_tpot_ms": 78.4,
"total_input_tokens": 65536,
"total_output_tokens": 65536,
"completed": 1000,
"duration": 67.5
}Worker log files contain runtime metrics:
.errfiles: Application logs with batch-level metrics.outfiles: Standard output (often empty or contains startup info)
Example log line format:
[2025-11-04 05:31:43 DP0 TP0 EP0] Prefill batch, #new-seq: 18, #new-token: 16384,
#cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
#prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.00
*_config.json files capture the complete node configuration at runtime:
- GPU information (count, type, memory, driver version)
- Server arguments (TP/DP/PP size, attention backend, KV cache settings)
- Environment variables (NCCL, CUDA, SGLANG settings)
- Command-line arguments actually passed
uv run streamlit run analysis/dashboard/app.pyThe dashboard opens at http://localhost:8501 by default.
On the left sidebar, you will see:
- Logs Directory Path: Set the path to your outputs directory (defaults to
outputs/)
The sidebar provides powerful filtering options:
Filter runs by GPU hardware (e.g., H100, A100, L40S). Useful when comparing across different hardware generations.
Filter by worker configuration:
- Disaggregated:
1P/4D,2P/8D, etc. - Aggregated:
4A,8A, etc.
Filter by input/output sequence length combinations (e.g., 1024/1024, 2048/512).
Filter by container image version to compare software updates.
Filter by custom tags you have assigned to runs. Tags help organize experiments:
baseline- Control runsoptimized- Runs with optimizationsproduction- Production-ready configurations
The dashboard has five main tabs:
Purpose: Visualize the efficiency trade-off between throughput per GPU and throughput per user.
What You See:
- X-axis: Output TPS/User - Token generation rate experienced by each user (1000/TPOT)
- Y-axis: Output TPS/GPU or Total TPS/GPU - GPU utilization efficiency
Key Features:
- Y-axis toggle: Switch between Output TPS/GPU (decode tokens only) and Total TPS/GPU (input + output)
- TPS/User cutoff line: Add a vertical line to mark your target throughput requirement
- Pareto Frontier: Highlight the efficient frontier where no other configuration is strictly better
Interpreting the Graph:
- Points up and to the right are better (higher efficiency AND higher per-user throughput)
- Points on the Pareto frontier represent optimal trade-offs
- Use the cutoff line to identify configurations meeting your latency requirements
Metric Calculations:
Data Export: Click "Download Data as CSV" to export all data points.
Purpose: Analyze latency metrics across concurrency levels.
Graphs Displayed:
-
TTFT (Time to First Token): Time from request submission to first output token
- Critical for perceived responsiveness
- Should remain stable under load
-
TPOT (Time Per Output Token): Average time between consecutive output tokens
- Determines streaming speed
- Lower TPOT = faster generation
-
ITL (Inter-Token Latency): Similar to TPOT but may include queueing delays
- Useful for diagnosing scheduling issues
Summary Statistics: Table showing min/max values for each metric across selected runs.
Purpose: Deep dive into runtime behavior of individual workers.
Aggregation Modes:
- Individual nodes: See every worker separately
- Group by DP rank: Average metrics across tensor parallel workers within each data parallel group
- Aggregate all nodes: Single averaged line per run
Prefill Node Metrics:
- Input Throughput: Tokens/s being processed in prefill
- Inflight Requests: Requests sent to decode workers awaiting completion
- KV Cache Utilization: Memory pressure indicator
- Queued Requests: Backpressure indicator
Decode Node Metrics:
- Running Requests: Active generation requests
- Generation Throughput: Output tokens/s
- KV Cache Utilization: Memory pressure
- Queued Requests: Decode capacity indicator
Disaggregation Metrics (Stacked or Separate views):
- Prealloc Queue: Requests waiting for memory allocation
- Transfer Queue: Requests waiting for KV cache transfer
- Running: Requests actively generating
Purpose: Verify prefill/decode capacity balance.
Interpretation:
- Lines should align: System is balanced
- Decode consistently below prefill: Need more decode nodes
- Decode above prefill: Prefill is the bottleneck, decode underutilized
Toggle: Convert from tokens/s to requests/s using ISL/OSL for clearer comparison.
Note: This tab only applies to disaggregated runs (prefill/decode split). Aggregated runs are skipped.
Purpose: Review the exact configuration of each run.
Information Displayed:
- Overview: Node count, GPU type, ISL/OSL, profiler type
- Topology: Physical node assignments, service distribution
- Node Config: Command-line arguments for each worker
- Environment: Environment variables by category (NCCL, SGLANG, CUDA, etc.)
Tags help organize and filter your experiments:
- Adding Tags: Expand a run in the sidebar Tags section, type a tag name, click "Add"
- Removing Tags: Click the "x" button next to any existing tag
- Filtering by Tags: Use the Tags filter in the Filters section
Tags are stored in the run's {jobid}.json file and persist across sessions.
Browse directly to the profiler results:
# List all runs
ls logs/
# View a specific run's metadata
cat logs/3667_1P_4D_20251110_192145/3667.json | jq .
# List benchmark results
ls logs/3667_1P_4D_20251110_192145/sa-bench_isl_1024_osl_1024/
# View results for specific concurrency
cat logs/3667_1P_4D_20251110_192145/sa-bench_isl_1024_osl_1024/concurrency_64.json | jq .Extract specific metrics across concurrency levels:
# Get throughput for all concurrency levels in a run
for f in logs/3667_*/sa-bench_*/concurrency_*.json; do
echo "$(basename $f): $(jq '.output_throughput' $f) TPS"
done
# Extract mean TTFT across runs
jq -r '[.max_concurrency, .mean_ttft_ms] | @tsv' logs/*/sa-bench_*/concurrency_*.json
# Find the best throughput across all runs
find logs -name "concurrency_*.json" -exec jq -r \
'[input_filename, .output_throughput] | @tsv' {} \; | \
sort -t$'\t' -k2 -nr | head -10For programmatic analysis, use the RunLoader class:
from analysis.srtlog import RunLoader, NodeAnalyzer
# Load all runs from a directory
loader = RunLoader("logs")
runs = loader.load_all()
# Filter runs
h100_runs = [r for r in runs if "H100" in (r.metadata.gpu_type or "")]
# Access metadata
for run in runs:
print(f"Job {run.job_id}: {run.metadata.topology_label}")
print(f" GPU: {run.metadata.gpu_type}")
print(f" Complete: {run.is_complete}")
# Access benchmark results
for i, concurrency in enumerate(run.profiler.concurrency_values):
tps = run.profiler.output_tps[i]
ttft = run.profiler.mean_ttft_ms[i]
print(f" C={concurrency}: {tps:.0f} TPS, TTFT={ttft:.1f}ms")
# Convert to DataFrame for analysis
df = loader.to_dataframe(runs)
print(df.describe())
# Analyze node-level metrics
analyzer = NodeAnalyzer()
nodes = analyzer.parse_run_logs("logs/3667_1P_4D_20251110_192145")
prefill_nodes = analyzer.get_prefill_nodes(nodes)
decode_nodes = analyzer.get_decode_nodes(nodes)import pandas as pd
from analysis.srtlog import RunLoader
loader = RunLoader("logs")
df = loader.to_dataframe()
# Best throughput per topology
best_by_topology = df.groupby(["Prefill Workers", "Decode Workers"])["Output TPS"].max()
print(best_by_topology)
# Average latency by concurrency
avg_latency = df.groupby("Concurrency")[["Mean TTFT (ms)", "Mean TPOT (ms)"]].mean()
print(avg_latency)
# Find optimal concurrency (best TPS/GPU while meeting latency target)
target_ttft = 500 # ms
valid_points = df[df["Mean TTFT (ms)"] <= target_ttft]
best = valid_points.loc[valid_points["Output TPS/GPU"].idxmax()]
print(f"Optimal: Concurrency={best['Concurrency']}, TPS/GPU={best['Output TPS/GPU']:.1f}")| Metric | Description | Unit |
|---|---|---|
| Output TPS | Total output tokens generated per second across all users | tokens/s |
| Total TPS | Total tokens processed (input + output) per second | tokens/s |
| Request Throughput | Number of requests completed per second | requests/s |
| Request Goodput | Successful requests per second (excludes errors) | requests/s |
| Output TPS/GPU | Output TPS divided by total GPU count | tokens/s/GPU |
| Output TPS/User | Per-user generation rate (1000/TPOT) | tokens/s |
| Metric | Description | What It Tells You |
|---|---|---|
| TTFT | Time to First Token | User-perceived responsiveness |
| TPOT | Time Per Output Token | Streaming speed during generation |
| ITL | Inter-Token Latency | Token spacing (similar to TPOT) |
| E2EL | End-to-End Latency | Total request duration |
- Mean: Average across all requests (sensitive to outliers)
- Median (p50): Middle value (50% of requests faster, 50% slower)
- p90: 90% of requests complete faster than this
- p99: 99% of requests complete faster than this (tail latency)
- Standard Deviation: Spread around the mean
Best Practices:
- Use p99 for SLA commitments
- Use median for typical user experience
- Large gap between median and p99 indicates scheduling issues or resource contention
These are general guidelines; actual targets depend on your use case:
| Metric | Good | Acceptable | Concerning |
|---|---|---|---|
| TTFT (p99) | < 500ms | 500-1000ms | > 1000ms |
| TPOT (mean) | < 30ms | 30-50ms | > 50ms |
| Output TPS/GPU | > 200 | 100-200 | < 100 |
| KV Cache Utilization | 40-80% | 20-90% | > 95% or < 10% |
| Queue Depth | 0-10 | 10-50 | > 50 (growing) |
Note: These vary significantly by:
- Model size (larger models = slower)
- Hardware (H100 vs A100 vs L40S)
- Sequence lengths (longer = slower)
- Batch sizes and concurrency
Establish a tagging convention for your team:
# Example tags
baseline-v1 # First baseline measurement
optimized-chunked # With chunked prefill
production-20251115 # Production configuration snapshot
regression-test # Automated regression tests
Compare two configurations:
- Run both configurations with identical:
- ISL/OSL settings
- Concurrency levels
- Hardware (if possible)
- Tag runs appropriately (e.g.,
configA,configB) - In dashboard:
- Filter to show only your tagged runs
- Select both runs for side-by-side comparison
- Use Pareto graph to see efficiency differences
Identify regressions:
from analysis.srtlog import RunLoader
loader = RunLoader("logs")
runs = loader.load_all()
# Get baseline and current runs by tag
baseline = [r for r in runs if "baseline" in r.tags]
current = [r for r in runs if "current" in r.tags]
# Compare at same concurrency
for b, c in zip(baseline, current):
for i, conc in enumerate(b.profiler.concurrency_values):
if conc in c.profiler.concurrency_values:
j = c.profiler.concurrency_values.index(conc)
b_tps = b.profiler.output_tps[i]
c_tps = c.profiler.output_tps[j]
diff = (c_tps - b_tps) / b_tps * 100
print(f"Concurrency {conc}: {diff:+.1f}% change")In the dashboard sidebar:
- Use Topology filter to compare same worker ratios
- Use ISL/OSL filter to compare same workload profiles
- Use Container filter to compare software versions
- Use GPU Type filter to compare hardware
From the dashboard Pareto tab, click "Download Data as CSV" to export:
- All selected runs
- All concurrency levels
- All computed metrics (TPS, TPS/GPU, TPS/User, latencies)
Raw JSON is already available in the logs directory. To consolidate:
# Export all results to a single JSON file
python -c "
import json
from analysis.srtlog import RunLoader
loader = RunLoader('logs')
df = loader.to_dataframe()
print(df.to_json(orient='records', indent=2))
" > all_results.jsonThe analysis system automatically caches parsed data as Parquet files:
import pandas as pd
# Read cached benchmark results
df = pd.read_parquet("logs/3667_1P_4D_20251110_192145/.cache/benchmark_results.parquet")
# Read cached node metrics
nodes_df = pd.read_parquet("logs/3667_1P_4D_20251110_192145/.cache/node_metrics.parquet")Grafana/InfluxDB:
from influxdb_client import InfluxDBClient
from analysis.srtlog import RunLoader
loader = RunLoader("logs")
df = loader.to_dataframe()
# Write to InfluxDB
with InfluxDBClient(url="http://localhost:8086", token="...") as client:
write_api = client.write_api()
# Convert DataFrame to line protocol and writeJupyter Notebooks:
# In a Jupyter cell
from analysis.srtlog import RunLoader
import matplotlib.pyplot as plt
loader = RunLoader("logs")
df = loader.to_dataframe()
# Create custom visualizations
df.groupby("Concurrency")["Output TPS"].mean().plot(kind="bar")
plt.title("Average Throughput by Concurrency")
plt.show()Symptoms: Dashboard shows spinner indefinitely or errors on startup
Solutions:
- Check logs directory exists:
ls -la logs/ - Verify at least one run has
{jobid}.json:ls logs/*/*.json - Check for Python errors:
uv run streamlit run analysis/dashboard/app.py 2>&1 - Clear Streamlit cache:
rm -rf ~/.streamlit/cache
Symptoms: Some runs don't appear in the run selector
Causes and Solutions:
-
No metadata file: Each run must have
{jobid}.json# Check if metadata exists ls logs/3667_*/3667.json
-
No benchmark results: Runs without profiler output are skipped
# Check for profiler results ls logs/3667_*/sa-bench_*/
-
Profiling jobs:
torch-profilertype runs are intentionally skipped# Check profiler type jq '.profiler_metadata.type' logs/3667_*/3667.json
-
Cache invalidation: Force reload by clicking "Sync Now" or restarting dashboard
Symptoms: Dashboard shows "Job X is incomplete - Missing concurrencies: [128, 256]"
Causes:
- Benchmark timed out before completing all concurrency levels
- Job was cancelled mid-run
- Profiler crashed at higher concurrencies
Solutions:
- Check SLURM logs for timeout or OOM errors
- Re-run with longer timeout
- Reduce max concurrency for resource-constrained setups
Symptoms: Node Metrics tab shows "No log files found"
Causes:
- Log files don't match expected pattern
- Logs were not captured (stderr redirect issue)
Solutions:
- Verify log file naming:
ls logs/3667_*/*_prefill_*.err - Check file contents:
head logs/3667_*/*_prefill_*.err - Verify log format matches expected patterns (see Log Files section)
Symptoms: Dashboard takes a long time to load or refresh
Causes:
- Many runs to parse
- Cache invalidation
- Large log files
Solutions:
- Parquet caching speeds up subsequent loads automatically
- Delete old runs you no longer need
- Use filters to reduce the number of selected runs
- Increase
_cache_versionincomponents.pyonly when parser changes
Symptoms: Metrics don't match expected values or show as "N/A"
Causes:
- Benchmark output format changed
- Incomplete benchmark run
- Parse error in result files
Solutions:
-
Verify raw JSON is valid:
jq . logs/3667_*/sa-bench_*/concurrency_64.json
-
Check for required fields:
jq 'keys' logs/3667_*/sa-bench_*/concurrency_64.json
-
Clear parquet cache and reload:
rm -rf logs/3667_*/.cache/
make dashboard
# or
uv run streamlit run analysis/dashboard/app.pyfrom analysis.srtlog import RunLoader, NodeAnalyzer
# Load runs
loader = RunLoader("logs")
runs = loader.load_all()
# Get DataFrame
df = loader.to_dataframe()
# Parse node logs
analyzer = NodeAnalyzer()
nodes = analyzer.parse_run_logs("logs/3667_1P_4D_20251110_192145")- Run metadata:
logs/{run_dir}/{jobid}.json - Benchmark results:
logs/{run_dir}/{profiler}_isl_{isl}_osl_{osl}/concurrency_*.json - Worker logs:
logs/{run_dir}/*_{prefill|decode}_*.err - Node configs:
logs/{run_dir}/*_config.json - Cache files:
logs/{run_dir}/.cache/*.parquet