This guide covers performance benchmarking for LLM-D deployments using GuideLLM and synthetic test data generators designed to demonstrate LLM-D's intelligent routing benefits.
Benchmarking helps you:
- Validate deployment performance
- Compare LLM-D intelligent routing vs. vanilla vLLM (round-robin)
- Demonstrate prefix caching effectiveness
- Test heterogeneous workload handling
- Establish performance baselines for SLAs
GuideLLM is the recommended benchmarking tool for LLM inference. It supports:
- Concurrent request testing
- Request-per-second (RPS) rate limiting
- Custom data files for realistic workloads
- JSON output for analysis
Deploy Prometheus and Grafana for real-time metrics visualization.
All artifacts are included in this playbook. Run commands from the playbook directory.
# From the playbook directory
cd "llm-d playbook"
# Deploy monitoring (using the included monitoring stack)
oc apply -k monitoring
# Wait for Grafana
oc wait --for=condition=ready pod -l app=grafana -n llm-d-monitoring --timeout=300s
# Get Grafana URL
export GRAFANA_URL=$(oc get route grafana-secure -n llm-d-monitoring -o jsonpath='{.spec.host}')
echo "Grafana: https://$GRAFANA_URL"Access Grafana with default credentials: admin / admin
| Metric | What to Look For |
|---|---|
| KV Cache Hit Rate | Higher is better - LLM-D should show 90%+ vs ~25% for round-robin |
| Time to First Token (TTFT) | Lower P95/P99 indicates better tail latency |
| Requests per Second | Overall throughput |
| GPU Utilization | Balanced utilization across replicas |
LLM-D includes synthetic test data generators specifically designed to demonstrate the benefits of intelligent routing.
cd gitops/instance/guidellm/llm-d-test-data-generator
pip install -r requirements.txtThe prefix cache generator creates prompt pairs with shared prefixes to simulate multi-turn conversations. This demonstrates how LLM-D's prefix-aware routing improves cache hit rates.
How it works:
- Generates pairs of prompts where the second prompt contains the first as a prefix
- Simulates multi-turn conversations with shared context
- Interleaves prefix-only and full prompts to test cache reuse
Generate test data:
cd prefix
# Quick test (10 concurrent users)
python prefix-cache-generator.py \
--target-prefix-words 5000 \
--target-continuation-words 1000 \
--num-pairs 100 \
--chunk-size 20 \
--output-prefix-csv "pairs-10.csv" \
--output-guidellm-csv "prompts-10.csv"
# Generate data sets for various concurrency levels
./generate-all.shOutput files:
prefix-pairs.csv- Side-by-side view of prefix and full promptsprefix-prompts.csv- GuideLLM-ready format with interleaved prompts
The heterogeneous generator creates mixed workloads with different request sizes. This is useful for testing P/D disaggregation scenarios.
Generate test data:
cd heterogeneous
# Generate mixed workload (90% short, 10% long)
python heterogeneous-workload-generator.py \
--workload-n-words 500 \
--workload-m-words 10000 \
--total-prompts 10000 \
--ratio-n-to-m 9 \
--output-tokens 250 \
--output-csv "heterogeneous-prompts.csv"Parameters:
--workload-n-words: Word count for "small" requests (default: 500)--workload-m-words: Word count for "large" requests (default: 10000)--ratio-n-to-m: Ratio of small to large requests (e.g., 9 means 9:1)
pip install guidellm[recommended]==0.3.1# Get Gateway URL
export INFERENCE_URL=$(oc -n openshift-ingress get gateway openshift-ai-inference \
-o jsonpath='{.status.addresses[0].value}')
# Set target endpoint
export TARGET="http://${INFERENCE_URL}/<namespace>/<llm-d-instance>"
export MODEL="Qwen/Qwen3-4B" # Match your deployed model
echo "Target: $TARGET"guidellm benchmark run \
--target $TARGET \
--model $MODEL \
--data prompts-10.csv \
--rate-type concurrent \
--rate 10 \
--max-seconds 120 \
--output-path results-10.jsonUse the provided script to run benchmarks at multiple concurrency levels:
#!/bin/bash
# bench-all.sh
TARGET=http://<gateway-hostname>/<namespace>/<llm-d-instance>
MODEL=Qwen/Qwen3-4B
SCENARIO_NAME="llm-d-intelligent-inference-x2"
MAX_SECONDS=120
# Benchmark configurations: rate and data file
BENCHMARKS=(
"500 prompts-500.csv"
"250 prompts-250.csv"
"100 prompts-100.csv"
"50 prompts-50.csv"
"25 prompts-25.csv"
"10 prompts-10.csv"
)
for benchmark in "${BENCHMARKS[@]}"; do
RATE=$(echo $benchmark | awk '{print $1}')
DATA=$(echo $benchmark | awk '{print $2}')
echo "Running benchmark with rate=$RATE and data=$DATA"
guidellm benchmark run --target $TARGET \
--model $MODEL \
--data $DATA \
--rate-type concurrent \
--rate $RATE \
--max-seconds $MAX_SECONDS \
--output-path $SCENARIO_NAME-$RATE.json
done
# Archive results
tar -cf $SCENARIO_NAME.tar $SCENARIO_NAME-*.json
echo "Archive created: $SCENARIO_NAME.tar"For production benchmarks, run GuideLLM as a Kubernetes Job:
kind: Job
apiVersion: batch/v1
metadata:
name: guidellm-benchmark-job
namespace: demo-llm
spec:
backoffLimit: 1
completions: 1
template:
spec:
restartPolicy: Never
containers:
- name: guidellm
image: 'ghcr.io/vllm-project/guidellm@sha256:f7123f5a4b9283e721a9b43bc99e8b2a1d9eac1c1e1ecba47b5368998c341ff3'
command:
- guidellm
args:
- benchmark
- '--target'
- 'http://openshift-ai-inference-openshift-default.openshift-ingress.svc.cluster.local/<namespace>/<model>'
- '--model'
- 'Qwen/Qwen3-4B'
- '--processor'
- 'Qwen/Qwen3-4B'
- '--data'
- '{"prompt_tokens":1000,"output_tokens":1000}'
- '--rate-type'
- concurrent
- '--max-seconds'
- '300'
- '--rate'
- '1,2,4,8,16'
- '--output-path'
- /results/output.json
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: hf_token
- name: HF_HOME
value: /tmp/huggingface_cache
volumeMounts:
- name: results-volume
mountPath: /results
volumes:
- name: results-volume
persistentVolumeClaim:
claimName: benchmark-results-pvcCreate a ConfigMap with your generated test data:
# Create ConfigMap from test data
oc create configmap benchmark-data -n demo-llm \
--from-file=prompts-10.csv \
--from-file=prompts-50.csv \
--from-file=prompts-100.csvMount in the Job:
spec:
template:
spec:
containers:
- name: guidellm
volumeMounts:
- name: data-volume
mountPath: /data
args:
- benchmark
- '--data'
- '/data/prompts-100.csv'
# ... other args
volumes:
- name: data-volume
configMap:
name: benchmark-dataThe playbook includes pre-configured vLLM and LLM-D deployments for comparison benchmarks.
First, deploy vanilla vLLM to establish a baseline:
# Deploy vLLM (round-robin load balancing) - included in playbook
oc apply -k vllm
# Wait for pods
oc wait --for=condition=ready pod -l serving.kserve.io/inferenceservice=qwen-vllm \
-n demo-llm --timeout=300sexport VLLM_TARGET="http://qwen-vllm-lb.demo-llm.svc.cluster.local:8000"
guidellm benchmark run \
--target $VLLM_TARGET \
--model $MODEL \
--data prompts-100.csv \
--rate-type concurrent \
--rate 100 \
--max-seconds 300 \
--output-path vllm-baseline.json# Clean up vLLM
oc delete -k vllm
# Reset Prometheus
oc delete pod -l app=prometheus -n llm-d-monitoring
oc wait --for=condition=ready pod -l app=prometheus -n llm-d-monitoring --timeout=120s
# Deploy LLM-D - included in playbook
oc apply -k llm-d
# Wait for pods
oc wait --for=condition=ready pod -l app.kubernetes.io/name=qwen \
-n demo-llm --timeout=300sexport LLMD_TARGET="http://openshift-ai-inference-openshift-default.openshift-ingress.svc.cluster.local/demo-llm/qwen"
guidellm benchmark run \
--target $LLMD_TARGET/v1 \
--model $MODEL \
--data prompts-100.csv \
--rate-type concurrent \
--rate 100 \
--max-seconds 300 \
--output-path llm-d-results.jsonTime to First Token (TTFT):
P50: 123.22 ms
P95: 744.71 ms <-- High tail latency (frustrated users)
P99: 840.95 ms
First Turn vs Subsequent Turns (Prefix Caching):
First turn avg: 351.64 ms
Later turns avg: 196.29 ms
Speedup ratio: 1.79x <-- Suboptimal cache reuse
Time to First Token (TTFT):
P50: 92.09 ms
P95: 271.60 ms <-- Significantly lower tail latency
P99: 674.21 ms
First Turn vs Subsequent Turns (Prefix Caching):
First turn avg: 361.79 ms
Later turns avg: 94.22 ms
Speedup ratio: 3.84x <-- Excellent cache reuse
| Metric | vLLM | LLM-D | Improvement |
|---|---|---|---|
| P50 TTFT | 123 ms | 92 ms | 25% faster |
| P95 TTFT | 745 ms | 272 ms | 63% faster |
| P99 TTFT | 841 ms | 674 ms | 20% faster |
| Cache Speedup | 1.79x | 3.84x | 2.1x better |
| Cache Hit Rate | ~25% | ~90%+ | 3.6x better |
| Feature | vLLM (Round-Robin) | LLM-D (Intelligent Routing) |
|---|---|---|
| Routing Strategy | Random/Round-robin | Prefix-aware scoring |
| Cache Hits | ~25% (1 in 4 replicas) | ~90%+ (routes to cached replica) |
| P95 Latency | High variance | Consistent, lower |
| GPU Utilization | Imbalanced | Balanced via KV-cache scoring |
| Parameter | Description | Example |
|---|---|---|
--target |
Inference endpoint URL | http://gateway/ns/model |
--model |
Model name for tokenizer | Qwen/Qwen3-4B |
--processor |
Processor name (optional) | Qwen/Qwen3-4B |
--data |
Data file or inline JSON | prompts.csv or {"prompt_tokens":1000} |
--rate-type |
concurrent, constant, or sweep |
concurrent |
--rate |
Concurrency or RPS (comma-separated for multiple) | 1,2,4,8,16 |
--max-seconds |
Benchmark duration | 300 |
--max-requests |
Maximum requests | 1000 |
--output-path |
Results file path | results.json |
| Rate Type | Behavior | Use Case |
|---|---|---|
concurrent |
Fixed number of concurrent requests | Controlled A/B comparisons |
constant |
Fixed requests per second | Testing specific throughput targets |
sweep |
Automatically discovers throughput limits | Capacity planning |
Warning: Sweep mode is designed to find the saturation limit of a model deployment. It will automatically probe to discover the upper throughput boundary, then run benchmarks to characterize performance.
How Sweep Mode Works:
- GuideLLM probes the system to identify its maximum sustainable throughput
- It then runs benchmarks at various points up to that discovered limit
- The discovered limit will likely differ between systems (e.g., vLLM vs LLM-D)
Why This Can Be Misleading for A/B Comparisons:
Because sweep mode discovers each system's limit independently, comparing sweep results between vLLM and LLM-D may not be an apples-to-apples comparison - each system will be tested at different effective concurrency levels based on their respective saturation points.
Recommendations:
-
For A/B comparisons (vLLM vs LLM-D): Use
--rate-type concurrentwith explicit, identical rate values# Same fixed concurrency for both systems --rate-type concurrent --rate '1,2,4,8,16'
-
For capacity planning: Sweep mode is appropriate when you want to understand a single system's limits
--rate-type sweep
-
For production baselines: Test at your expected production load, not at saturation
GuideLLM saves results as JSON by default. You can re-export to other formats using benchmark from-file:
# Re-export JSON results to console, CSV, and HTML
podman run -it --rm \
--user 0:0 \
--volume ./results:/results:z \
ghcr.io/vllm-project/guidellm:v0.5.2 \
benchmark from-file \
--output-path /results \
--output-formats console \
--output-formats output.csv \
--output-formats output.html \
/results/output.jsonAvailable Output Formats:
| Format | Description | Use Case |
|---|---|---|
console |
Text summary to stdout | Quick troubleshooting, logs |
*.csv |
Comma-separated values | Spreadsheet analysis, data processing |
*.html |
Interactive HTML report | Sharing results, documentation |
*.json |
Full JSON output | Programmatic analysis, archival |
Tip: The console format is usually sufficient for troubleshooting. Use CSV/HTML when you need to share results or do deeper analysis.
For benchmarking in disconnected environments:
- Copy tokenizer files:
oc cp tokenizer_config.json guidellm:/config
oc cp tokenizer.json guidellm:/config- Use local tokenizer:
guidellm benchmark run \
--target $TARGET \
--model /config \
--processor /config \
# ... other args- Watch benchmark logs:
oc logs -f <guidellm-pod>- P95 TTFT < 500ms: Good tail latency
- Cache Hit Rate > 80%: Effective prefix caching (LLM-D)
- No waiting requests: Not saturated
- Consistent ITL: Stable generation speed
| Symptom | Likely Cause | Action |
|---|---|---|
| P95 >> P50 | Request queueing | Add replicas or GPUs |
| Low cache hit rate | Routing not working | Check scheduler logs |
| High TTFT, low ITL | Prefill bottleneck | Consider P/D disaggregation |
| Increasing ITL | Batch saturation | Reduce concurrency or add replicas |
- Performance Debugging if results don't meet expectations
- Review Advanced Deployment for optimization options like P/D disaggregation