Running Benchmarks

This guide covers performance benchmarking for LLM-D deployments using GuideLLM and synthetic test data generators designed to demonstrate LLM-D's intelligent routing benefits.

Overview

Benchmarking helps you:

Validate deployment performance
Compare LLM-D intelligent routing vs. vanilla vLLM (round-robin)
Demonstrate prefix caching effectiveness
Test heterogeneous workload handling
Establish performance baselines for SLAs

GuideLLM Overview

GuideLLM is the recommended benchmarking tool for LLM inference. It supports:

Concurrent request testing
Request-per-second (RPS) rate limiting
Custom data files for realistic workloads
JSON output for analysis

Step 0: Deploy Monitoring Stack

Deploy Prometheus and Grafana for real-time metrics visualization.

All artifacts are included in this playbook. Run commands from the playbook directory.

# From the playbook directory
cd "llm-d playbook"

# Deploy monitoring (using the included monitoring stack)
oc apply -k monitoring

# Wait for Grafana
oc wait --for=condition=ready pod -l app=grafana -n llm-d-monitoring --timeout=300s

# Get Grafana URL
export GRAFANA_URL=$(oc get route grafana-secure -n llm-d-monitoring -o jsonpath='{.spec.host}')
echo "Grafana: https://$GRAFANA_URL"

Access Grafana with default credentials: admin / admin

Key Metrics to Watch

Metric	What to Look For
KV Cache Hit Rate	Higher is better - LLM-D should show 90%+ vs ~25% for round-robin
Time to First Token (TTFT)	Lower P95/P99 indicates better tail latency
Requests per Second	Overall throughput
GPU Utilization	Balanced utilization across replicas

Step 1: Generate Test Data

LLM-D includes synthetic test data generators specifically designed to demonstrate the benefits of intelligent routing.

Install Dependencies

cd gitops/instance/guidellm/llm-d-test-data-generator
pip install -r requirements.txt

Prefix Cache Generator

The prefix cache generator creates prompt pairs with shared prefixes to simulate multi-turn conversations. This demonstrates how LLM-D's prefix-aware routing improves cache hit rates.

How it works:

Generates pairs of prompts where the second prompt contains the first as a prefix
Simulates multi-turn conversations with shared context
Interleaves prefix-only and full prompts to test cache reuse

Generate test data:

cd prefix

# Quick test (10 concurrent users)
python prefix-cache-generator.py \
  --target-prefix-words 5000 \
  --target-continuation-words 1000 \
  --num-pairs 100 \
  --chunk-size 20 \
  --output-prefix-csv "pairs-10.csv" \
  --output-guidellm-csv "prompts-10.csv"

# Generate data sets for various concurrency levels
./generate-all.sh

Output files:

prefix-pairs.csv - Side-by-side view of prefix and full prompts
prefix-prompts.csv - GuideLLM-ready format with interleaved prompts

Heterogeneous Workload Generator

The heterogeneous generator creates mixed workloads with different request sizes. This is useful for testing P/D disaggregation scenarios.

Generate test data:

cd heterogeneous

# Generate mixed workload (90% short, 10% long)
python heterogeneous-workload-generator.py \
  --workload-n-words 500 \
  --workload-m-words 10000 \
  --total-prompts 10000 \
  --ratio-n-to-m 9 \
  --output-tokens 250 \
  --output-csv "heterogeneous-prompts.csv"

Parameters:

--workload-n-words: Word count for "small" requests (default: 500)
--workload-m-words: Word count for "large" requests (default: 10000)
--ratio-n-to-m: Ratio of small to large requests (e.g., 9 means 9:1)

Step 2: Run GuideLLM Benchmarks

Install GuideLLM

pip install guidellm[recommended]==0.3.1

Get Inference Endpoint

# Get Gateway URL
export INFERENCE_URL=$(oc -n openshift-ingress get gateway openshift-ai-inference \
  -o jsonpath='{.status.addresses[0].value}')

# Set target endpoint
export TARGET="http://${INFERENCE_URL}/<namespace>/<llm-d-instance>"
export MODEL="Qwen/Qwen3-4B"  # Match your deployed model

echo "Target: $TARGET"

Run Single Benchmark

guidellm benchmark run \
  --target $TARGET \
  --model $MODEL \
  --data prompts-10.csv \
  --rate-type concurrent \
  --rate 10 \
  --max-seconds 120 \
  --output-path results-10.json

Run Benchmark Suite

Use the provided script to run benchmarks at multiple concurrency levels:

#!/bin/bash
# bench-all.sh

TARGET=http://<gateway-hostname>/<namespace>/<llm-d-instance>
MODEL=Qwen/Qwen3-4B
SCENARIO_NAME="llm-d-intelligent-inference-x2"
MAX_SECONDS=120

# Benchmark configurations: rate and data file
BENCHMARKS=(
  "500 prompts-500.csv"
  "250 prompts-250.csv"
  "100 prompts-100.csv"
  "50 prompts-50.csv"
  "25 prompts-25.csv"
  "10 prompts-10.csv"
)

for benchmark in "${BENCHMARKS[@]}"; do
  RATE=$(echo $benchmark | awk '{print $1}')
  DATA=$(echo $benchmark | awk '{print $2}')

  echo "Running benchmark with rate=$RATE and data=$DATA"
  guidellm benchmark run --target $TARGET \
    --model $MODEL \
    --data $DATA \
    --rate-type concurrent \
    --rate $RATE \
    --max-seconds $MAX_SECONDS \
    --output-path $SCENARIO_NAME-$RATE.json
done

# Archive results
tar -cf $SCENARIO_NAME.tar $SCENARIO_NAME-*.json
echo "Archive created: $SCENARIO_NAME.tar"

Step 3: Run as Kubernetes Job

For production benchmarks, run GuideLLM as a Kubernetes Job:

kind: Job
apiVersion: batch/v1
metadata:
  name: guidellm-benchmark-job
  namespace: demo-llm
spec:
  backoffLimit: 1
  completions: 1
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: guidellm
          image: 'ghcr.io/vllm-project/guidellm@sha256:f7123f5a4b9283e721a9b43bc99e8b2a1d9eac1c1e1ecba47b5368998c341ff3'
          command:
            - guidellm
          args:
            - benchmark
            - '--target'
            - 'http://openshift-ai-inference-openshift-default.openshift-ingress.svc.cluster.local/<namespace>/<model>'
            - '--model'
            - 'Qwen/Qwen3-4B'
            - '--processor'
            - 'Qwen/Qwen3-4B'
            - '--data'
            - '{"prompt_tokens":1000,"output_tokens":1000}'
            - '--rate-type'
            - concurrent
            - '--max-seconds'
            - '300'
            - '--rate'
            - '1,2,4,8,16'
            - '--output-path'
            - /results/output.json
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-secret
                  key: hf_token
            - name: HF_HOME
              value: /tmp/huggingface_cache
          volumeMounts:
            - name: results-volume
              mountPath: /results
      volumes:
        - name: results-volume
          persistentVolumeClaim:
            claimName: benchmark-results-pvc

Using Custom Test Data in Kubernetes

Create a ConfigMap with your generated test data:

# Create ConfigMap from test data
oc create configmap benchmark-data -n demo-llm \
  --from-file=prompts-10.csv \
  --from-file=prompts-50.csv \
  --from-file=prompts-100.csv

Mount in the Job:

spec:
  template:
    spec:
      containers:
        - name: guidellm
          volumeMounts:
            - name: data-volume
              mountPath: /data
          args:
            - benchmark
            - '--data'
            - '/data/prompts-100.csv'
            # ... other args
      volumes:
        - name: data-volume
          configMap:
            name: benchmark-data

Step 4: Compare LLM-D vs vLLM

The playbook includes pre-configured vLLM and LLM-D deployments for comparison benchmarks.

Deploy vLLM Baseline

First, deploy vanilla vLLM to establish a baseline:

# Deploy vLLM (round-robin load balancing) - included in playbook
oc apply -k vllm

# Wait for pods
oc wait --for=condition=ready pod -l serving.kserve.io/inferenceservice=qwen-vllm \
  -n demo-llm --timeout=300s

Run Baseline Benchmark

export VLLM_TARGET="http://qwen-vllm-lb.demo-llm.svc.cluster.local:8000"

guidellm benchmark run \
  --target $VLLM_TARGET \
  --model $MODEL \
  --data prompts-100.csv \
  --rate-type concurrent \
  --rate 100 \
  --max-seconds 300 \
  --output-path vllm-baseline.json

Deploy LLM-D

# Clean up vLLM
oc delete -k vllm

# Reset Prometheus
oc delete pod -l app=prometheus -n llm-d-monitoring
oc wait --for=condition=ready pod -l app=prometheus -n llm-d-monitoring --timeout=120s

# Deploy LLM-D - included in playbook
oc apply -k llm-d

# Wait for pods
oc wait --for=condition=ready pod -l app.kubernetes.io/name=qwen \
  -n demo-llm --timeout=300s

Run LLM-D Benchmark

export LLMD_TARGET="http://openshift-ai-inference-openshift-default.openshift-ingress.svc.cluster.local/demo-llm/qwen"

guidellm benchmark run \
  --target $LLMD_TARGET/v1 \
  --model $MODEL \
  --data prompts-100.csv \
  --rate-type concurrent \
  --rate 100 \
  --max-seconds 300 \
  --output-path llm-d-results.json

Expected Results

vLLM Baseline (Round-Robin)

Time to First Token (TTFT):
  P50:        123.22 ms
  P95:        744.71 ms    <-- High tail latency (frustrated users)
  P99:        840.95 ms

First Turn vs Subsequent Turns (Prefix Caching):
  First turn avg:      351.64 ms
  Later turns avg:     196.29 ms
  Speedup ratio:         1.79x   <-- Suboptimal cache reuse

LLM-D (Intelligent Routing)

Time to First Token (TTFT):
  P50:         92.09 ms
  P95:        271.60 ms    <-- Significantly lower tail latency
  P99:        674.21 ms

First Turn vs Subsequent Turns (Prefix Caching):
  First turn avg:      361.79 ms
  Later turns avg:      94.22 ms
  Speedup ratio:         3.84x   <-- Excellent cache reuse

Results Comparison

Metric	vLLM	LLM-D	Improvement
P50 TTFT	123 ms	92 ms	25% faster
P95 TTFT	745 ms	272 ms	63% faster
P99 TTFT	841 ms	674 ms	20% faster
Cache Speedup	1.79x	3.84x	2.1x better
Cache Hit Rate	~25%	~90%+	3.6x better

Why LLM-D Performs Better

Feature	vLLM (Round-Robin)	LLM-D (Intelligent Routing)
Routing Strategy	Random/Round-robin	Prefix-aware scoring
Cache Hits	~25% (1 in 4 replicas)	~90%+ (routes to cached replica)
P95 Latency	High variance	Consistent, lower
GPU Utilization	Imbalanced	Balanced via KV-cache scoring

GuideLLM Parameters Reference

Parameter	Description	Example
`--target`	Inference endpoint URL	`http://gateway/ns/model`
`--model`	Model name for tokenizer	`Qwen/Qwen3-4B`
`--processor`	Processor name (optional)	`Qwen/Qwen3-4B`
`--data`	Data file or inline JSON	`prompts.csv` or `{"prompt_tokens":1000}`
`--rate-type`	`concurrent`, `constant`, or `sweep`	`concurrent`
`--rate`	Concurrency or RPS (comma-separated for multiple)	`1,2,4,8,16`
`--max-seconds`	Benchmark duration	`300`
`--max-requests`	Maximum requests	`1000`
`--output-path`	Results file path	`results.json`

Rate Type Options

Rate Type	Behavior	Use Case
`concurrent`	Fixed number of concurrent requests	Controlled A/B comparisons
`constant`	Fixed requests per second	Testing specific throughput targets
`sweep`	Automatically discovers throughput limits	Capacity planning

Warning: Sweep Mode (`--rate-type sweep`)

Warning: Sweep mode is designed to find the saturation limit of a model deployment. It will automatically probe to discover the upper throughput boundary, then run benchmarks to characterize performance.

How Sweep Mode Works:

GuideLLM probes the system to identify its maximum sustainable throughput
It then runs benchmarks at various points up to that discovered limit
The discovered limit will likely differ between systems (e.g., vLLM vs LLM-D)

Why This Can Be Misleading for A/B Comparisons:

Because sweep mode discovers each system's limit independently, comparing sweep results between vLLM and LLM-D may not be an apples-to-apples comparison - each system will be tested at different effective concurrency levels based on their respective saturation points.

Recommendations:

For A/B comparisons (vLLM vs LLM-D): Use --rate-type concurrent with explicit, identical rate values
```
# Same fixed concurrency for both systems
--rate-type concurrent --rate '1,2,4,8,16'
```
For capacity planning: Sweep mode is appropriate when you want to understand a single system's limits
```
--rate-type sweep
```
For production baselines: Test at your expected production load, not at saturation

Exporting Results to Different Formats

GuideLLM saves results as JSON by default. You can re-export to other formats using benchmark from-file:

# Re-export JSON results to console, CSV, and HTML
podman run -it --rm \
    --user 0:0 \
    --volume ./results:/results:z \
    ghcr.io/vllm-project/guidellm:v0.5.2 \
    benchmark from-file \
        --output-path /results \
        --output-formats console \
        --output-formats output.csv \
        --output-formats output.html \
        /results/output.json

Available Output Formats:

Format	Description	Use Case
`console`	Text summary to stdout	Quick troubleshooting, logs
`*.csv`	Comma-separated values	Spreadsheet analysis, data processing
`*.html`	Interactive HTML report	Sharing results, documentation
`*.json`	Full JSON output	Programmatic analysis, archival

Tip: The console format is usually sufficient for troubleshooting. Use CSV/HTML when you need to share results or do deeper analysis.

Disconnected Environment Notes

For benchmarking in disconnected environments:

Copy tokenizer files:

oc cp tokenizer_config.json guidellm:/config
oc cp tokenizer.json guidellm:/config

Use local tokenizer:

guidellm benchmark run \
  --target $TARGET \
  --model /config \
  --processor /config \
  # ... other args

Watch benchmark logs:

oc logs -f <guidellm-pod>

Interpreting Results

Healthy Deployment Indicators

P95 TTFT < 500ms: Good tail latency
Cache Hit Rate > 80%: Effective prefix caching (LLM-D)
No waiting requests: Not saturated
Consistent ITL: Stable generation speed

Warning Signs

Symptom	Likely Cause	Action
P95 >> P50	Request queueing	Add replicas or GPUs
Low cache hit rate	Routing not working	Check scheduler logs
High TTFT, low ITL	Prefill bottleneck	Consider P/D disaggregation
Increasing ITL	Batch saturation	Reduce concurrency or add replicas

Next Steps

Performance Debugging if results don't meet expectations
Review Advanced Deployment for optimization options like P/D disaggregation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Benchmarks

Overview

GuideLLM Overview

Step 0: Deploy Monitoring Stack

Key Metrics to Watch

Step 1: Generate Test Data

Install Dependencies

Prefix Cache Generator

Heterogeneous Workload Generator

Step 2: Run GuideLLM Benchmarks

Install GuideLLM

Get Inference Endpoint

Run Single Benchmark

Run Benchmark Suite

Step 3: Run as Kubernetes Job

Using Custom Test Data in Kubernetes

Step 4: Compare LLM-D vs vLLM

Deploy vLLM Baseline

Run Baseline Benchmark

Deploy LLM-D

Run LLM-D Benchmark

Expected Results

vLLM Baseline (Round-Robin)

LLM-D (Intelligent Routing)

Results Comparison

Why LLM-D Performs Better

GuideLLM Parameters Reference

Rate Type Options

Warning: Sweep Mode (`--rate-type sweep`)

Exporting Results to Different Formats

Disconnected Environment Notes

Interpreting Results

Healthy Deployment Indicators

Warning Signs

Next Steps

FilesExpand file tree

06-running-benchmarks.md

Latest commit

History

06-running-benchmarks.md

File metadata and controls

Running Benchmarks

Overview

GuideLLM Overview

Step 0: Deploy Monitoring Stack

Key Metrics to Watch

Step 1: Generate Test Data

Install Dependencies

Prefix Cache Generator

Heterogeneous Workload Generator

Step 2: Run GuideLLM Benchmarks

Install GuideLLM

Get Inference Endpoint

Run Single Benchmark

Run Benchmark Suite

Step 3: Run as Kubernetes Job

Using Custom Test Data in Kubernetes

Step 4: Compare LLM-D vs vLLM

Deploy vLLM Baseline

Run Baseline Benchmark

Deploy LLM-D

Run LLM-D Benchmark

Expected Results

vLLM Baseline (Round-Robin)

LLM-D (Intelligent Routing)

Results Comparison

Why LLM-D Performs Better

GuideLLM Parameters Reference

Rate Type Options

Warning: Sweep Mode (--rate-type sweep)

Exporting Results to Different Formats

Disconnected Environment Notes

Interpreting Results

Healthy Deployment Indicators

Warning Signs

Next Steps

Warning: Sweep Mode (`--rate-type sweep`)