The configuration explorer is a library that helps find the most cost-effective, optimal configuration for serving models on llm-d based on hardware specification, workload characteristics, and SLO requirements. A CLI and web app front-end are available to use the library immediately.
Features include:
- Capacity planning:
- Get per-GPU memory requirements to load and serve a model, and compare parallelism strategies.
- Determine KV cache memory requirements based on workload characteristics.
- Estimate peak activation memory, CUDA graph overhead, and non-torch memory for accurate capacity planning (see empirical results for intermediate memory here)
- GPU recommendation:
- Recommend GPU configurations using BentoML's llm-optimizer roofline algorithm.
- Analyze throughput, latency (TTFT, ITL, E2E), and concurrency trade-offs across different GPU types.
- Export recommendations in JSON format for integration with other tools.
Core functionality is currently a Python module within
llm-d-benchmark. In the future we may consider shipping as a separate package depending on community interest.
Requires python 3.11+
-
(optional) Set up a Python virtual environment
python -m venv .venv source .venv/bin/activate -
Install the
config_explorerPython module after cloning thellm-d-benchmarkrepository.git clone https://github.com/llm-d/llm-d-benchmark.git cd llm-d-benchmark pip install -e ./config_explorer
After installation, the config-explorer command will become available:
# Run capacity planning
config-explorer plan --model Qwen/Qwen2.5-3B --gpu-memory 80 --max-model-len 16000
# Run GPU recommendation and performance estimation (BentoML's roofline model)
config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --max-gpus 8
# Human-readable output
config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --pretty
# Override GPU costs with custom pricing
config-explorer estimate --model Qwen/Qwen2.5-3B \
--input-len 512 --output-len 128 \
--custom-gpu-cost H100:30.50 \
--custom-gpu-cost A100:22 \
--custom-gpu-cost L40:25.00 \
--pretty
# Start the Streamlit web app
pip install -r requirements-streamlit.txt # one-time installation (run from config_explorer/ dir)
config-explorer start
# Get help
config-explorer --helpA Streamlit frontend is provided to showcase the capabilities of the Configuration Explorer in a more intuitive way. Before using this frontend additional requirements must be installed.
After installing Streamlit requirements (pip install -r requirements-streamlit.txt) the web app may then be started with
cd config_explorer # must run from within the config_explorer directory
config-explorer startThe Streamlit frontend includes the following pages:
- Capacity Planner - Analyze GPU memory requirements and capacity planning for LLM models
- GPU Recommender - Get optimal GPU recommendations based on model and workload requirements
The GPU Recommender page helps you find the optimal GPU for running LLM inference. To use it:
- Configure Model: Enter a HuggingFace model ID (e.g.,
meta-llama/Llama-2-7b-hf) - Set Workload Parameters:
- Input sequence length (tokens)
- Output sequence length (tokens)
- Maximum number of GPUs
- Define Constraints (Optional):
- Maximum Time to First Token (TTFT) in milliseconds
- Maximum Inter-Token Latency (ITL) in milliseconds
- Maximum End-to-End Latency in seconds
- Run Analysis: Click the "Run Analysis" button to evaluate all available GPUs
- Review Results:
- Compare GPUs through interactive visualizations
- Examine throughput, latency metrics, and optimal concurrency
- View detailed analysis for each GPU
- Export: Download results as JSON or CSV for further analysis
The GPU Recommender uses BentoML's llm-optimizer roofline algorithm to provide synthetic performance estimates across different GPU types, helping you make informed decisions about hardware selection.
Note: You'll need a HuggingFace token set as the HF_TOKEN environment variable to access gated models.
The GPU Recommender displays cost information to help you find cost-effective GPU configurations:
- Default GPU Costs: Built-in reference costs for common GPUs (H200, H100, A100, L40, etc.)
-
Custom Cost Override: Specify your own GPU costs using any numbers you prefer (e.g., your actual
$/hour or $ /token pricing) - Cost-Based Sorting: Sort results by cost to find the most economical option
For GPU recommender API usage see ./examples/gpu_recommender_example.py.