A simple solution for benchmarking vLLM, SGLang, and TensorRT-LLM on Modal with guidellm. ⏱️
pip install -r requirements.txt
To run a single benchmark, you can use the run_benchmark
command, which will save your results to a local file.
For example, to run a synchronous-rate benchmark with vLLM and save the results to results.json
:
MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
OUTPUT_PATH=results.json
modal run -w $OUTPUT_PATH -m cli.run_benchmark --model $MODEL --llm-server-type vllm
Or, to run a fixed-rate multi-GPU benchmark with SGLang:
GPU_COUNT=4
MODEL=meta-llama/Llama-3.3-70B-Instruct
REQUESTS_PER_SECOND=5
modal run -w $OUTPUT_PATH -m cli.run_benchmark --gpu "H100:$GPU_COUNT" --model $MODEL --llm-server-type sglang --rate-type constant --rate $REQUESTS_PER_SECOND --llm-server-config "{\"extra_args\": [\"--tp-size\", \"$GPU_COUNT\"]}"
Or, to run a throughput test with TensorRT-LLM:
modal run -w $OUTPUT_PATH -m cli.run_benchmark --model $MODEL --llm-server-type tensorrt-llm --rate-type throughput
To run multiple benchmarks at once, first deploy the Datasette UI, which will let you easily view the results later:
modal deploy -m stopwatch
Then, start a benchmark suite from a configuration file:
modal run -m cli.run_benchmark_suite --config-path configs/llama3.yaml
Once the suite has finished, you will be given a URL to a UI where you can view your results, and a command to download a JSONL file with your results.
To profile vLLM with the PyTorch profiler, use the following command:
MODEL=meta-llama/Llama-3.1-8B-Instruct
OUTPUT_PATH=trace.json.gz
modal run -w $OUTPUT_PATH -m cli.run_profiler --model $MODEL --num-requests 10
Once the profiling is done, the trace will be saved to trace.json.gz
, which you can open and visualize at https://ui.perfetto.dev.
Keep in mind that generated traces can get very large, so it is recommended to only send a few requests while profiling.
Stopwatch is available under the MIT license. See the LICENSE file for more details.