vLLM A/B Benchmark

Reusable harness for comparing vLLM branches side-by-side. Define runs in YAML — the tool handles cloning, building, server lifecycle, and result collection.

Usage

# Full pipeline: build + benchmark
vllm-bench run configs/mla_quant_fusion/b200_fp8.yaml

# Individual steps
vllm-bench build configs/mla_quant_fusion/b200_fp8.yaml
vllm-bench bench configs/mla_quant_fusion/b200_fp8.yaml

# Filter to specific runs
vllm-bench run configs/mla_quant_fusion/b200_fp8.yaml --run main_baseline --run feature_fuse_on

# Override build/server settings
vllm-bench run configs/my_test.yaml --port 9000 --max-jobs 8

# Clean caches (flashinfer, torch compile, triton, venvs)
vllm-bench clean configs/my_test.yaml
vllm-bench clean --all  # also remove huggingface model cache

Force rebuild via environment variable:

FORCE_BUILD=1 vllm-bench run configs/my_test.yaml

Config

Runs are organized under branches. Build config is set at the global or branch level. Server and bench config can be overridden at global, branch, or run level.

project:
  repo: https://github.com/vllm-project/vllm.git
  model: RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8

build:
  max_jobs: 0.8              # <=1: fraction of CPU cores, >1: absolute

server:
  tp: 1
  max_model_len: 4096

bench:
  num_prompts: 1000
  input_len: 128
  output_len: 128
  request_rate: 50

branches:
  main:
    runs:
      - label: main_warmup
      - label: main_baseline

  my-feature-branch:
    build:
      use_precompiled: false   # branch-level build override
    runs:
      - label: feature_off
        server:                # per-run server override
          compilation_config:
            pass_config:
              fuse_attn_quant: false

      - label: feature_on
        server:
          compilation_config:
            pass_config:
              fuse_attn_quant: true

Config hierarchy

Level	build	server	bench	test	eval
Global (top-level)	yes	yes	yes	yes	yes
Branch (`branches.<name>`)	yes	yes	yes	yes	yes
Run (`branches.<name>.runs[]`)	no	yes	yes	yes	yes

Effective config is merged: global -> branch -> run (for server/bench) or global -> branch (for build).

Build options

Field	Default	Description
`use_precompiled`	`true`	Use precompiled vllm wheel (main branch only)
`cuda_arch`	auto	CUDA architecture (e.g. `"9.0"`, `"12.1"`)
`max_jobs`	`1.0`	Build parallelism (<=1: fraction of cores)
`install_flash_attn`	`false`	Install flash-attn from source
`torch_index`	`cu130`	PyTorch wheel index URL

Server options

Field	Default	Description
`tp`	`1`	Tensor parallel size
`max_model_len`	`4096`	Maximum sequence length
`enforce_eager`	`false`	Skip CUDA graph capture
`gpu_memory_utilization`	none	GPU memory fraction
`port`	`8000`	Server port
`wait_timeout`	`600`	Seconds to wait for server startup
`compilation_config`	none	vllm compilation config (JSON)

Bench options

Field	Default	Description
`num_prompts`	`1000`	Number of benchmark requests
`input_len`	`128`	Random input length
`output_len`	`128`	Random output length
`request_rate`	`inf`	Requests per second
`warmup_prompts`	`3`	Warmup requests before benchmark

Test options

Field	Default	Description
`script`	none	Pytest target relative to repo root (e.g. `tests/models/test_mla.py`)
`args`	none	Additional pytest CLI args (e.g. `-x -v --timeout=600`)

Tests run directly in the built venv — no server is started.

Subcommands

Command	Description
`run`	Build + benchmark (full pipeline)
`build`	Clone and build all branches
`test`	Run pytest tests (builds must already exist)
`bench`	Benchmark only (builds must already exist)
`compile`	Pre-compile CUDA graphs (start/check/stop)
`clean`	Remove caches (flashinfer, torch compile, triton, venvs)

Build caching

Builds are cached per branch and skipped when the commit and build config haven't changed. Sequential execution allows later builds to reuse uv's package cache from earlier builds.

After install, the tool verifies that torch's CUDA version matches the system CUDA version. Precompiled vllm wheels may resolve torch with CUDA 12 on CUDA 13 systems, causing cuBLAS initialization failures on SM 100+ GPUs.

Environment Variables

Variable	Description
`FORCE_BUILD`	`1` to rebuild even if cached

Requirements

Python 3.12+, uv, CUDA toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
configs		configs
src/vllm_bench		src/vllm_bench
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
vllm-bench.sh		vllm-bench.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM A/B Benchmark

Usage

Config

Config hierarchy

Build options

Server options

Bench options

Test options

Subcommands

Build caching

Environment Variables

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM A/B Benchmark

Usage

Config

Config hierarchy

Build options

Server options

Bench options

Test options

Subcommands

Build caching

Environment Variables

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages