Run phase of the benchmark lifecycle. Executes benchmark workloads against deployed model-serving infrastructure, collects results, and optionally runs local analysis.
After standup, run a benchmark with a specific harness and workload:
# Sanity check with inference-perf
llmdbenchmark --spec guides/pd-disaggregation run -p <NS> \
-l inference-perf -w sanity_random.yaml
# Run with monitoring (metrics scraping + pod log capture)
llmdbenchmark --spec guides/pd-disaggregation run -p <NS> \
-l inference-perf -w sanity_random.yaml -f
# Run with a different harness
llmdbenchmark --spec guides/pd-disaggregation run -p <NS> \
-l vllm-benchmark -w random_concurrent.yamlPoint directly at an existing endpoint -- no --spec or prior standup required:
# Against a known service URL
llmdbenchmark run -p <NS> \
-U http://my-model-service.<NS>.svc.cluster.local:80 \
-l inference-perf -w sanity_random.yaml -m Qwen/Qwen3-32B
# Against an external URL
llmdbenchmark run -p <NS> \
-U https://my-model.example.com/v1 \
-l inference-perf -w chatbot_synthetic.yaml -m Qwen/Qwen3-32BWhen -U is provided, the run skips endpoint auto-detection (step 02) and
model verification (step 03), and goes straight to profile rendering and
harness deployment.
If a stack is already deployed but you don't know the endpoint URL:
# Modelservice -- get the gateway service URL
oc get svc -n <NS> -l app.kubernetes.io/name=llm-d-infra -o jsonpath='{.items[0].metadata.name}'
# Typically: http://infra-llmdbench-inference-gateway-istio.<NS>.svc.cluster.local:80
# Standalone -- get the standalone service URL
oc get svc -n <NS> -l app.kubernetes.io/managed-by=llm-d-benchmark -o jsonpath='{.items[0].metadata.name}'
# Typically: http://vllm-standalone-<model-id>.<NS>.svc.cluster.local:8000
# OpenShift route (external access)
oc get route -n <NS> -o jsonpath='{.items[0].spec.host}'
# Use: http://<route-host>
# Verify the endpoint is serving
oc exec -n <NS> $(oc get pod -n <NS> -l role=llm-d-benchmark-data-access -o jsonpath='{.items[0].metadata.name}') \
-- curl -s http://infra-llmdbench-inference-gateway-istio.<NS>.svc.cluster.local:80/v1/models# Generate a config YAML from current settings
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml --generate-config
# Use the generated config for subsequent runs
llmdbenchmark run -c /path/to/run-config.yamlStart the harness pod with sleep infinity instead of running the benchmark.
Useful for exec-ing into the pod to debug issues:
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml -d
# Then exec into the pod:
oc exec -it -n <NS> $(oc get pod -n <NS> -l app=llmdbench-harness-launcher -o name) -- bashIf a previous run left results on the PVC, collect them without re-running:
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -z| Flag | Env Var | Description |
|---|---|---|
-l HARNESS |
LLMDBENCH_HARNESS |
Harness name: inference-perf, guidellm, vllm-benchmark, inferencemax, nop |
-w WORKLOAD |
LLMDBENCH_WORKLOAD |
Workload profile name (e.g., sanity_random.yaml, chatbot_synthetic.yaml) |
-p NS |
LLMDBENCH_NAMESPACE |
Namespace(s) -- deploy_ns,harness_ns or single namespace for both |
-m MODEL |
LLMDBENCH_MODEL |
Model name override (e.g., Qwen/Qwen3-32B) |
-t METHODS |
LLMDBENCH_METHODS |
Deploy method used during standup (standalone, modelservice) |
-U URL |
LLMDBENCH_ENDPOINT_URL |
Explicit endpoint URL -- enables run-only mode, skips auto-detection |
-c FILE |
Run config YAML file -- enables run-only mode | |
--generate-config |
Generate a run config YAML from current settings and exit | |
-f |
LLMDBENCH_MONITORING |
Enable vLLM metrics scraping and pod log capture |
-q SA |
LLMDBENCH_SERVICE_ACCOUNT |
Service account for harness pods |
-g VARS |
LLMDBENCH_HARNESS_ENVVARS_TO_YAML |
Comma-separated env var names to propagate into harness pod |
-e FILE |
LLMDBENCH_EXPERIMENTS |
Experiment treatments YAML for parameter sweeping |
-o OVERRIDES |
LLMDBENCH_OVERRIDES |
Workload parameter overrides (param=value,...) |
-j N |
LLMDBENCH_PARALLELISM |
Number of parallel harness pods (default: 1) |
-r DEST |
LLMDBENCH_OUTPUT |
Results destination: local path, gs://bucket, or s3://bucket |
-x DATASET |
LLMDBENCH_DATASET |
Dataset URL for harness replay |
--wait-timeout N |
LLMDBENCH_WAIT_TIMEOUT |
Seconds to wait for harness completion (default: 3600) |
-z |
LLMDBENCH_SKIP |
Skip execution, only collect existing results from PVC |
-d |
LLMDBENCH_DEBUG |
Debug mode -- start harness with sleep infinity |
--analyze |
Run local analysis on collected results | |
-s STEPS |
Step filter (e.g., 0,1,6 or 2-8) |
|
-k FILE |
LLMDBENCH_KUBECONFIG |
Kubeconfig path |
Steps are registered in steps/__init__.py via get_run_steps():
| Step | Name | Description |
|---|---|---|
| 00 | RunPreflightStep |
Validate cluster connectivity, harness namespace, output destination |
| 01 | RunCleanupPreviousStep |
Delete leftover harness pods/configmaps from previous runs |
| 02 | HarnessNamespaceStep |
Prepare harness namespace (PVC, data access pod) |
| 03 | DetectEndpointStep |
Auto-detect model-serving endpoint (standalone service, gateway, or -U override) |
| 04 | VerifyModelStep |
Verify model is served at endpoint via /v1/models |
| 05 | RenderProfilesStep |
Render workload profile templates with runtime values; handle experiment treatments |
| 06 | CreateProfileConfigmapStep |
Create ConfigMaps for workload profiles and harness scripts |
| 07 | DeployHarnessStep |
Deploy harness pod(s), wait for completion, collect results, capture logs |
| 08 | WaitCompletionStep |
Wait for harness pods (used when step 07 does not inline waiting) |
| 09 | CollectResultsStep |
Collect results from PVC to local workspace |
| 12 | AnalyzeResultsStep |
Run local analysis on results (before upload so artifacts are included) |
| 10 | UploadResultsStep |
Upload results to cloud storage (GCS/S3) |
| 11 | RunCleanupPostStep |
Delete harness pods and ConfigMaps |
Note: Step 12 (analyze) runs before step 10 (upload) so analysis artifacts are included in the upload.
The -l flag overrides the scenario's default harness. The harness name determines
which scripts run inside the harness pod and which profiles are available:
# inference-perf (default for most well-lit paths)
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -l inference-perf -w sanity_random.yaml
# vllm-benchmark (built-in vLLM benchmarking)
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -l vllm-benchmark -w random_concurrent.yaml
# guidellm
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -l guidellm -w chatbot_synthetic.yaml
# nop (no-op -- measures model load time only)
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -l nop -w nop.yamlOverride individual workload profile parameters without editing the profile YAML:
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml \
-o "concurrency=32,duration=300,max_tokens=512"Execute a matrix of parameter combinations automatically:
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml \
-e experiments/concurrency_sweep.yamlThe experiment YAML defines factors and levels:
run:
factors:
- name: concurrency
levels: [1, 8, 32, 64]
- name: max_tokens
levels: [128, 512]Each combination becomes a treatment. Step 06 runs them sequentially: deploy pod, wait, collect, clean, then next treatment.
Deploy multiple harness pods per treatment for higher aggregate load:
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml -j 4Each pod gets a unique experiment ID suffix (_1, _2, etc.) and writes
results to a separate subdirectory on the PVC.
llmdbenchmark --spec guides/pd-disaggregation run -p <NS> \
-l inference-perf -w sanity_random.yaml -fWith -f, the run:
- Sets
LLMDBENCH_VLLM_COMMON_METRICS_SCRAPE_ENABLED=trueon the harness pod -- the harness entrypoint scrapes vLLM/metricsbefore and after each benchmark - After each treatment, captures logs from:
- Harness pods
- EPP (Endpoint Picker) pods
- IGW (Inference Gateway) pods
- Model-serving (decode/prefill) pods
- Runs
process_epp_logs.pyon EPP logs to extract scheduling metrics
Results appear in the workspace under:
results/
{experiment_id}/
metrics/raw/ -- raw Prometheus-format metrics per pod
metrics/processed/ -- aggregated metrics_summary.json
logs/
epp_pods.log -- EPP pod logs
igw_pods.log -- Gateway pod logs
modelserving_pods.log -- Decode/prefill pod logs
pod_status.txt -- Pod status snapshot
epp_metrics/ -- EPP analysis output (if available)
# Google Cloud Storage
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml -r gs://my-bucket/results/
# Amazon S3
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml -r s3://my-bucket/results/# Only deploy and wait (skip cleanup, analysis, upload)
llmdbenchmark --spec guides/inference-scheduling run -p <NS> \
-l inference-perf -w sanity_random.yaml -s 0-8
# Only collect existing results and analyze
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -s 8,11
# Only clean up leftover pods
llmdbenchmark --spec guides/inference-scheduling run -p <NS> -s 10The run supports three modes of treatment generation:
- Single treatment (default) -- one harness pod runs the workload profile as-is
- Override treatments --
-omodifies profile parameters for a single treatment - Experiment treatments --
-egenerates a matrix of treatments from factor/level combinations
Step 04 handles the treatment rendering. Step 06 executes them sequentially: for each treatment, it deploys harness pod(s), waits for completion, collects results and logs, then cleans up before the next treatment.
Results are collected to two locations:
- Local workspace -- Step 08 copies results from the harness PVC to the local
workspace under
results/. Each treatment gets its own subdirectory named{experiment_id}_{parallel_idx}. - PVC -- Results persist on the harness PVC (
workload-pvc) until teardown.
The run summary at the end shows both locations:
BENCHMARK RUN SUMMARY
Local results: /Users/user/data/pd-disaggregation/vezio-20260321-211419-773/results
PVC results: oc exec -n ns $(oc get pod -n ns -l role=llm-d-benchmark-data-access ...) -- ls /requests/
In dry-run mode (--dry-run):
- Steps 00-05 log what they would do without modifying the cluster
- Step 06 logs the harness pod spec without deploying
- Steps 08-11 skip file operations and cloud uploads, logging what would happen
- All logged commands show the exact kubectl/oc command that would execute
run/
├── __init__.py -- Package marker
└── steps/
├── __init__.py -- Step registry (get_run_steps)
├── step_00_preflight.py
├── step_01_cleanup_previous.py
├── step_02_detect_endpoint.py
├── step_03_verify_model.py
├── step_04_render_profiles.py
├── step_05_create_profile_configmap.py
├── step_06_deploy_harness.py
├── step_07_wait_completion.py
├── step_08_collect_results.py
├── step_09_upload_results.py
├── step_10_cleanup_post.py
└── step_11_analyze_results.py