Analyzing Benchmark Results

This guide covers all aspects of analyzing your benchmark results, from launching the interactive dashboard to programmatically parsing raw data.

Overview
Understanding Output Structure
Interactive Dashboard
Command-Line Analysis
Metrics Deep Dive
Comparing Experiments
Exporting Data
Troubleshooting Analysis

Overview

The analysis toolkit provides two primary ways to analyze benchmark results:

Tool	Use Case	Access
Interactive Dashboard	Visual exploration, comparing runs, real-time filtering	`make dashboard`
Programmatic Analysis	Automation, custom analysis, CI/CD integration	Python API or raw JSON

When to Use Each Tool

Dashboard: Ideal for exploring results, comparing configurations, identifying trends, and presenting findings to stakeholders
Command-Line/Python: Best for automated analysis pipelines, custom metrics, and integration with other tools

Understanding Output Structure

Directory Layout

Each benchmark run creates a directory under logs/ (or your configured output directory):

logs/
  3667_1P_4D_20251110_192145/           # Job ID + Topology + Timestamp
    3667.json                            # Run metadata (required)
    sa-bench_isl_1024_osl_1024/          # Profiler results directory
      concurrency_16.json                # Results for each concurrency level
      concurrency_32.json
      concurrency_64.json
      ...
    watchtower-navy-cn01_prefill_w0.err  # Worker stderr logs
    watchtower-navy-cn01_prefill_w0.out  # Worker stdout logs
    watchtower-navy-cn02_decode_w0.err
    watchtower-navy-cn02_decode_w0.out
    watchtower-navy-cn01_prefill_config.json  # Node configuration snapshots
    watchtower-navy-cn02_decode_config.json
    .cache/                              # Parquet cache files (auto-generated)
      benchmark_results.parquet
      node_metrics.parquet

Directory Naming Convention

The run directory name encodes key information:

{SLURM_JOB_ID}_{TOPOLOGY}_{TIMESTAMP}
      |            |           |
      |            |           +-- YYYYMMDD_HHMMSS
      |            +-- 1P_4D (1 prefill, 4 decode) or 8A (8 aggregated)
      +-- SLURM job identifier

Metadata File (`{jobid}.json`)

The JSON metadata file is the source of truth for run configuration:

{
  "run_metadata": {
    "slurm_job_id": "3667",
    "run_date": "20251110_192145",
    "container": "ghcr.io/sgl-project/sglang:v0.4.1-cu121",
    "prefill_nodes": 1,
    "decode_nodes": 4,
    "prefill_workers": 1,
    "decode_workers": 4,
    "gpus_per_node": 8,
    "gpu_type": "NVIDIA H100 80GB HBM3",
    "mode": "disaggregated",
    "model_dir": "/models/DeepSeek-V3"
  },
  "profiler_metadata": {
    "type": "sa-bench",
    "isl": "1024",
    "osl": "1024",
    "concurrencies": "16x32x64x128x256"
  },
  "tags": ["baseline", "h100"]
}

Benchmark Result Files

Each concurrency level produces a JSON file in the profiler results directory:

{
  "max_concurrency": 64,
  "output_throughput": 15234.5,
  "total_token_throughput": 23456.7,
  "request_throughput": 14.8,
  "mean_ttft_ms": 245.3,
  "mean_tpot_ms": 32.1,
  "mean_itl_ms": 31.8,
  "mean_e2el_ms": 1456.2,
  "median_ttft_ms": 198.4,
  "median_tpot_ms": 28.9,
  "p99_ttft_ms": 892.1,
  "p99_tpot_ms": 78.4,
  "total_input_tokens": 65536,
  "total_output_tokens": 65536,
  "completed": 1000,
  "duration": 67.5
}

Log Files

Worker log files contain runtime metrics:

.err files: Application logs with batch-level metrics
.out files: Standard output (often empty or contains startup info)

Example log line format:

[2025-11-04 05:31:43 DP0 TP0 EP0] Prefill batch, #new-seq: 18, #new-token: 16384,
#cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
#prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.00

Configuration Snapshots

*_config.json files capture the complete node configuration at runtime:

GPU information (count, type, memory, driver version)
Server arguments (TP/DP/PP size, attention backend, KV cache settings)
Environment variables (NCCL, CUDA, SGLANG settings)
Command-line arguments actually passed

Interactive Dashboard

Launching the Dashboard

uv run streamlit run analysis/dashboard/app.py

The dashboard opens at http://localhost:8501 by default.

Dashboard Configuration

On the left sidebar, you will see:

Logs Directory Path: Set the path to your outputs directory (defaults to outputs/)

Run Selection

The sidebar provides powerful filtering options:

GPU Type Filter

Filter runs by GPU hardware (e.g., H100, A100, L40S). Useful when comparing across different hardware generations.

Topology Filter

Filter by worker configuration:

Disaggregated: 1P/4D, 2P/8D, etc.
Aggregated: 4A, 8A, etc.

ISL/OSL Filter

Filter by input/output sequence length combinations (e.g., 1024/1024, 2048/512).

Container Filter

Filter by container image version to compare software updates.

Tags Filter

Filter by custom tags you have assigned to runs. Tags help organize experiments:

baseline - Control runs
optimized - Runs with optimizations
production - Production-ready configurations

Dashboard Tabs

The dashboard has five main tabs:

1. Pareto Graph Tab

Purpose: Visualize the efficiency trade-off between throughput per GPU and throughput per user.

What You See:

X-axis: Output TPS/User - Token generation rate experienced by each user (1000/TPOT)
Y-axis: Output TPS/GPU or Total TPS/GPU - GPU utilization efficiency

Key Features:

Y-axis toggle: Switch between Output TPS/GPU (decode tokens only) and Total TPS/GPU (input + output)
TPS/User cutoff line: Add a vertical line to mark your target throughput requirement
Pareto Frontier: Highlight the efficient frontier where no other configuration is strictly better

Interpreting the Graph:

Points up and to the right are better (higher efficiency AND higher per-user throughput)
Points on the Pareto frontier represent optimal trade-offs
Use the cutoff line to identify configurations meeting your latency requirements

Metric Calculations:

$$\text{Output TPS/GPU} = \frac{\text{Total Output Throughput (tokens/s)}}{\text{Total Number of GPUs}}$$

$$\text{Output TPS/User} = \frac{1000}{\text{Mean TPOT (ms)}}$$

Data Export: Click "Download Data as CSV" to export all data points.

2. Latency Analysis Tab

Purpose: Analyze latency metrics across concurrency levels.

Graphs Displayed:

TTFT (Time to First Token): Time from request submission to first output token
- Critical for perceived responsiveness
- Should remain stable under load
TPOT (Time Per Output Token): Average time between consecutive output tokens
- Determines streaming speed
- Lower TPOT = faster generation
ITL (Inter-Token Latency): Similar to TPOT but may include queueing delays
- Useful for diagnosing scheduling issues

Summary Statistics: Table showing min/max values for each metric across selected runs.

3. Node Metrics Tab

Purpose: Deep dive into runtime behavior of individual workers.

Aggregation Modes:

Individual nodes: See every worker separately
Group by DP rank: Average metrics across tensor parallel workers within each data parallel group
Aggregate all nodes: Single averaged line per run

Prefill Node Metrics:

Input Throughput: Tokens/s being processed in prefill
Inflight Requests: Requests sent to decode workers awaiting completion
KV Cache Utilization: Memory pressure indicator
Queued Requests: Backpressure indicator

Decode Node Metrics:

Running Requests: Active generation requests
Generation Throughput: Output tokens/s
KV Cache Utilization: Memory pressure
Queued Requests: Decode capacity indicator

Disaggregation Metrics (Stacked or Separate views):

Prealloc Queue: Requests waiting for memory allocation
Transfer Queue: Requests waiting for KV cache transfer
Running: Requests actively generating

4. Rate Match Tab

Purpose: Verify prefill/decode capacity balance.

Interpretation:

Lines should align: System is balanced
Decode consistently below prefill: Need more decode nodes
Decode above prefill: Prefill is the bottleneck, decode underutilized

Toggle: Convert from tokens/s to requests/s using ISL/OSL for clearer comparison.

Note: This tab only applies to disaggregated runs (prefill/decode split). Aggregated runs are skipped.

5. Configuration Tab

Purpose: Review the exact configuration of each run.

Information Displayed:

Overview: Node count, GPU type, ISL/OSL, profiler type
Topology: Physical node assignments, service distribution
Node Config: Command-line arguments for each worker
Environment: Environment variables by category (NCCL, SGLANG, CUDA, etc.)

Managing Tags

Tags help organize and filter your experiments:

Adding Tags: Expand a run in the sidebar Tags section, type a tag name, click "Add"
Removing Tags: Click the "x" button next to any existing tag
Filtering by Tags: Use the Tags filter in the Filters section

Tags are stored in the run's {jobid}.json file and persist across sessions.

Command-Line Analysis

Accessing Raw JSON Results

Browse directly to the profiler results:

# List all runs
ls logs/

# View a specific run's metadata
cat logs/3667_1P_4D_20251110_192145/3667.json | jq .

# List benchmark results
ls logs/3667_1P_4D_20251110_192145/sa-bench_isl_1024_osl_1024/

# View results for specific concurrency
cat logs/3667_1P_4D_20251110_192145/sa-bench_isl_1024_osl_1024/concurrency_64.json | jq .

Using jq for Analysis

Extract specific metrics across concurrency levels:

# Get throughput for all concurrency levels in a run
for f in logs/3667_*/sa-bench_*/concurrency_*.json; do
  echo "$(basename $f): $(jq '.output_throughput' $f) TPS"
done

# Extract mean TTFT across runs
jq -r '[.max_concurrency, .mean_ttft_ms] | @tsv' logs/*/sa-bench_*/concurrency_*.json

# Find the best throughput across all runs
find logs -name "concurrency_*.json" -exec jq -r \
  '[input_filename, .output_throughput] | @tsv' {} \; | \
  sort -t$'\t' -k2 -nr | head -10

Python API

For programmatic analysis, use the RunLoader class:

from analysis.srtlog import RunLoader, NodeAnalyzer

# Load all runs from a directory
loader = RunLoader("logs")
runs = loader.load_all()

# Filter runs
h100_runs = [r for r in runs if "H100" in (r.metadata.gpu_type or "")]

# Access metadata
for run in runs:
    print(f"Job {run.job_id}: {run.metadata.topology_label}")
    print(f"  GPU: {run.metadata.gpu_type}")
    print(f"  Complete: {run.is_complete}")

    # Access benchmark results
    for i, concurrency in enumerate(run.profiler.concurrency_values):
        tps = run.profiler.output_tps[i]
        ttft = run.profiler.mean_ttft_ms[i]
        print(f"  C={concurrency}: {tps:.0f} TPS, TTFT={ttft:.1f}ms")

# Convert to DataFrame for analysis
df = loader.to_dataframe(runs)
print(df.describe())

# Analyze node-level metrics
analyzer = NodeAnalyzer()
nodes = analyzer.parse_run_logs("logs/3667_1P_4D_20251110_192145")
prefill_nodes = analyzer.get_prefill_nodes(nodes)
decode_nodes = analyzer.get_decode_nodes(nodes)

Pandas Analysis Examples

import pandas as pd
from analysis.srtlog import RunLoader

loader = RunLoader("logs")
df = loader.to_dataframe()

# Best throughput per topology
best_by_topology = df.groupby(["Prefill Workers", "Decode Workers"])["Output TPS"].max()
print(best_by_topology)

# Average latency by concurrency
avg_latency = df.groupby("Concurrency")[["Mean TTFT (ms)", "Mean TPOT (ms)"]].mean()
print(avg_latency)

# Find optimal concurrency (best TPS/GPU while meeting latency target)
target_ttft = 500  # ms
valid_points = df[df["Mean TTFT (ms)"] <= target_ttft]
best = valid_points.loc[valid_points["Output TPS/GPU"].idxmax()]
print(f"Optimal: Concurrency={best['Concurrency']}, TPS/GPU={best['Output TPS/GPU']:.1f}")

Metrics Deep Dive

Throughput Metrics

Metric	Description	Unit
Output TPS	Total output tokens generated per second across all users	tokens/s
Total TPS	Total tokens processed (input + output) per second	tokens/s
Request Throughput	Number of requests completed per second	requests/s
Request Goodput	Successful requests per second (excludes errors)	requests/s
Output TPS/GPU	Output TPS divided by total GPU count	tokens/s/GPU
Output TPS/User	Per-user generation rate (1000/TPOT)	tokens/s

Latency Metrics

Metric	Description	What It Tells You
TTFT	Time to First Token	User-perceived responsiveness
TPOT	Time Per Output Token	Streaming speed during generation
ITL	Inter-Token Latency	Token spacing (similar to TPOT)
E2EL	End-to-End Latency	Total request duration

Understanding Percentiles

Mean: Average across all requests (sensitive to outliers)
Median (p50): Middle value (50% of requests faster, 50% slower)
p90: 90% of requests complete faster than this
p99: 99% of requests complete faster than this (tail latency)
Standard Deviation: Spread around the mean

Best Practices:

Use p99 for SLA commitments
Use median for typical user experience
Large gap between median and p99 indicates scheduling issues or resource contention

What "Good" Metrics Look Like

These are general guidelines; actual targets depend on your use case:

Metric	Good	Acceptable	Concerning
TTFT (p99)	< 500ms	500-1000ms	> 1000ms
TPOT (mean)	< 30ms	30-50ms	> 50ms
Output TPS/GPU	> 200	100-200	< 100
KV Cache Utilization	40-80%	20-90%	> 95% or < 10%
Queue Depth	0-10	10-50	> 50 (growing)

Note: These vary significantly by:

Model size (larger models = slower)
Hardware (H100 vs A100 vs L40S)
Sequence lengths (longer = slower)
Batch sizes and concurrency

Comparing Experiments

Using Tags for Organization

Establish a tagging convention for your team:

# Example tags
baseline-v1          # First baseline measurement
optimized-chunked    # With chunked prefill
production-20251115  # Production configuration snapshot
regression-test      # Automated regression tests

A/B Comparison Patterns

Compare two configurations:

Run both configurations with identical:
- ISL/OSL settings
- Concurrency levels
- Hardware (if possible)
Tag runs appropriately (e.g., configA, configB)
In dashboard:
- Filter to show only your tagged runs
- Select both runs for side-by-side comparison
- Use Pareto graph to see efficiency differences

Identify regressions:

from analysis.srtlog import RunLoader

loader = RunLoader("logs")
runs = loader.load_all()

# Get baseline and current runs by tag
baseline = [r for r in runs if "baseline" in r.tags]
current = [r for r in runs if "current" in r.tags]

# Compare at same concurrency
for b, c in zip(baseline, current):
    for i, conc in enumerate(b.profiler.concurrency_values):
        if conc in c.profiler.concurrency_values:
            j = c.profiler.concurrency_values.index(conc)
            b_tps = b.profiler.output_tps[i]
            c_tps = c.profiler.output_tps[j]
            diff = (c_tps - b_tps) / b_tps * 100
            print(f"Concurrency {conc}: {diff:+.1f}% change")

Filtering by Parameters

In the dashboard sidebar:

Use Topology filter to compare same worker ratios
Use ISL/OSL filter to compare same workload profiles
Use Container filter to compare software versions
Use GPU Type filter to compare hardware

Exporting Data

CSV Export

From the dashboard Pareto tab, click "Download Data as CSV" to export:

All selected runs
All concurrency levels
All computed metrics (TPS, TPS/GPU, TPS/User, latencies)

JSON Export

Raw JSON is already available in the logs directory. To consolidate:

# Export all results to a single JSON file
python -c "
import json
from analysis.srtlog import RunLoader

loader = RunLoader('logs')
df = loader.to_dataframe()
print(df.to_json(orient='records', indent=2))
" > all_results.json

Parquet Export (Cached Data)

The analysis system automatically caches parsed data as Parquet files:

import pandas as pd

# Read cached benchmark results
df = pd.read_parquet("logs/3667_1P_4D_20251110_192145/.cache/benchmark_results.parquet")

# Read cached node metrics
nodes_df = pd.read_parquet("logs/3667_1P_4D_20251110_192145/.cache/node_metrics.parquet")

Integration with Other Tools

Grafana/InfluxDB:

from influxdb_client import InfluxDBClient
from analysis.srtlog import RunLoader

loader = RunLoader("logs")
df = loader.to_dataframe()

# Write to InfluxDB
with InfluxDBClient(url="http://localhost:8086", token="...") as client:
    write_api = client.write_api()
    # Convert DataFrame to line protocol and write

Jupyter Notebooks:

# In a Jupyter cell
from analysis.srtlog import RunLoader
import matplotlib.pyplot as plt

loader = RunLoader("logs")
df = loader.to_dataframe()

# Create custom visualizations
df.groupby("Concurrency")["Output TPS"].mean().plot(kind="bar")
plt.title("Average Throughput by Concurrency")
plt.show()

Troubleshooting Analysis

Dashboard Won't Load

Symptoms: Dashboard shows spinner indefinitely or errors on startup

Solutions:

Check logs directory exists: ls -la logs/
Verify at least one run has {jobid}.json: ls logs/*/*.json
Check for Python errors: uv run streamlit run analysis/dashboard/app.py 2>&1
Clear Streamlit cache: rm -rf ~/.streamlit/cache

Missing Runs in Dashboard

Symptoms: Some runs don't appear in the run selector

Causes and Solutions:

No metadata file: Each run must have {jobid}.json

# Check if metadata exists
ls logs/3667_*/3667.json

No benchmark results: Runs without profiler output are skipped
```
# Check for profiler results
ls logs/3667_*/sa-bench_*/
```

Profiling jobs: torch-profiler type runs are intentionally skipped

# Check profiler type
jq '.profiler_metadata.type' logs/3667_*/3667.json

Cache invalidation: Force reload by clicking "Sync Now" or restarting dashboard

Incomplete Run Warning

Symptoms: Dashboard shows "Job X is incomplete - Missing concurrencies: [128, 256]"

Causes:

Benchmark timed out before completing all concurrency levels
Job was cancelled mid-run
Profiler crashed at higher concurrencies

Solutions:

Check SLURM logs for timeout or OOM errors
Re-run with longer timeout
Reduce max concurrency for resource-constrained setups

No Node Metrics Found

Symptoms: Node Metrics tab shows "No log files found"

Causes:

Log files don't match expected pattern
Logs were not captured (stderr redirect issue)

Solutions:

Verify log file naming: ls logs/3667_*/*_prefill_*.err
Check file contents: head logs/3667_*/*_prefill_*.err
Verify log format matches expected patterns (see Log Files section)

Slow Dashboard Loading

Symptoms: Dashboard takes a long time to load or refresh

Causes:

Many runs to parse
Cache invalidation
Large log files

Solutions:

Parquet caching speeds up subsequent loads automatically
Delete old runs you no longer need
Use filters to reduce the number of selected runs
Increase _cache_version in components.py only when parser changes

Incorrect Metrics

Symptoms: Metrics don't match expected values or show as "N/A"

Causes:

Benchmark output format changed
Incomplete benchmark run
Parse error in result files

Solutions:

Verify raw JSON is valid:

jq . logs/3667_*/sa-bench_*/concurrency_64.json

Check for required fields:

jq 'keys' logs/3667_*/sa-bench_*/concurrency_64.json

Clear parquet cache and reload:
```
rm -rf logs/3667_*/.cache/
```

Quick Reference

Launch Dashboard

make dashboard
# or
uv run streamlit run analysis/dashboard/app.py

Python API Quick Start

from analysis.srtlog import RunLoader, NodeAnalyzer

# Load runs
loader = RunLoader("logs")
runs = loader.load_all()

# Get DataFrame
df = loader.to_dataframe()

# Parse node logs
analyzer = NodeAnalyzer()
nodes = analyzer.parse_run_logs("logs/3667_1P_4D_20251110_192145")

Key File Locations

Run metadata: logs/{run_dir}/{jobid}.json
Benchmark results: logs/{run_dir}/{profiler}_isl_{isl}_osl_{osl}/concurrency_*.json
Worker logs: logs/{run_dir}/*_{prefill|decode}_*.err
Node configs: logs/{run_dir}/*_config.json
Cache files: logs/{run_dir}/.cache/*.parquet

FilesExpand file tree

analyzing.md

Latest commit

History

analyzing.md

File metadata and controls

Analyzing Benchmark Results

Table of Contents

Overview

When to Use Each Tool

Understanding Output Structure

Directory Layout

Directory Naming Convention

Metadata File ({jobid}.json)

Benchmark Result Files

Log Files

Configuration Snapshots

Interactive Dashboard

Launching the Dashboard

Dashboard Configuration

Run Selection

GPU Type Filter

Topology Filter

ISL/OSL Filter

Container Filter

Tags Filter

Dashboard Tabs

1. Pareto Graph Tab

2. Latency Analysis Tab

3. Node Metrics Tab

4. Rate Match Tab

5. Configuration Tab

Managing Tags

Command-Line Analysis

Accessing Raw JSON Results

Using jq for Analysis

Python API

Pandas Analysis Examples

Metrics Deep Dive

Throughput Metrics

Latency Metrics

Understanding Percentiles

What "Good" Metrics Look Like

Comparing Experiments

Using Tags for Organization

A/B Comparison Patterns

Filtering by Parameters

Exporting Data

CSV Export

JSON Export

Parquet Export (Cached Data)

Integration with Other Tools

Troubleshooting Analysis

Dashboard Won't Load

Missing Runs in Dashboard

Incomplete Run Warning

No Node Metrics Found

Slow Dashboard Loading

Incorrect Metrics

Quick Reference

Launch Dashboard

Python API Quick Start

Key File Locations

Metadata File (`{jobid}.json`)