Tools for benchmarking vLLM performance with RBLN backend support.
running_benchmarks.py: The main automation script. It parses the YAML configuration and sequentially executes benchmarks by invokingbenchmark.shfor each scenario.benchmark.sh: The execution script that handles the lifecycle of a single benchmark run. It:- Starts the inference vLLM server (torch_compile or optimum).
- Verifies server health.
- Executes the
guidellmbenchmark command. - Performs cleanup (process termination) after completion.
benchmark_guide.yml: A YAML file used to define multiple benchmark scenarios. Key parameters include:model: Hugging Face model ID.platform: Backend to use (torch_compileoroptimum).max_num_sequences: List of concurrency levels (batch sizes) to test.tp_size: Tensor parallelism size.max_seq_len: Maximum sequence length.block_size: Block size for PagedAttention.len: Input/Output token length.duration: Duration of the benchmark run.
- Configure your scenarios in
benchmark_guide.yml. - Execute the runner script:
python running_benchmarks.py
# Options:
# python running_benchmarks.py --guide-file custom_config.yml --benchmark-script ./custom_benchmark.shUse extract_metrics.py to parse the generated CSV output and display a summary of key performance metrics (Throughput, TTFT, TPOT, ITL).
python extract_metrics.py path/to/results/benchmarks.csvExample Output:
Output Throughput (Mean) , TTFT (Median) , TTFT (Mean) , TTFT (Max) , TPOT (Median) , TPOT (Mean) , TPOT (Max) , ITL (Median) , ITL (Mean) , ITL (Max)
115.973 , 238.557 , 239.008 , 255.759 , 8.622 , 8.626 , 8.692 , 8.511 , 8.513 , 8.579