llm-d-incubation · namasl · Oct 29, 2025 · Oct 29, 2025 · Oct 29, 2025 · Oct 30, 2025
diff --git a/pyproject.toml b/pyproject.toml
@@ -64,6 +64,7 @@ exclude = [
     "dist",
     "generated_configs",
     "logs",
+    "src/config_explorer",
 ]
 
 [tool.ruff.lint]

diff --git a/src/config_explorer/Capacity_Planner.py b/src/config_explorer/Capacity_Planner.py
diff --git a/src/config_explorer/README.md b/src/config_explorer/README.md
@@ -0,0 +1,121 @@
+# Configuration Explorer
+
+The configuration explorer is a library that helps find the most cost-effective, optimal configuration for serving models on llm-d based on hardware specification, workload characteristics, and SLO requirements. A CLI and web app front-end are available to use the library immediately.
+
+Features include:
+
+- **Capacity planning**:
+  - Get per-GPU memory requirements to load and serve a model, and compare parallelism strategies.
+  - Determine KV cache memory requirements based on workload characteristics.
+  - Estimate peak activation memory, CUDA graph overhead, and non-torch memory for accurate capacity planning (see empirical results for intermediate memory [here](./empirical-vllm-memory-results.md))
+- **GPU recommendation**:
+  - Recommend GPU configurations using BentoML's llm-optimizer roofline algorithm.
+  - Analyze throughput, latency (TTFT, ITL, E2E), and concurrency trade-offs across different GPU types.
+  - Export recommendations in JSON format for integration with other tools.
+Core functionality is currently a Python module within `llm-d-benchmark`. In the future we may consider shipping as a separate package depending on community interest.
+
+## Installation
+
+**Requires python 3.11+**
+
+1. (optional) Set up a Python virtual environment
+
+    ```bash
+    python -m venv .venv
+    source .venv/bin/activate
+    ```
+
+2. Install the `config_explorer` Python module after cloning the `llm-d-benchmark` repository.
+
+    ```bash
+    git clone https://github.com/llm-d/llm-d-benchmark.git
+    cd llm-d-benchmark
+    pip install -e ./config_explorer
+    ```
+
+# Usage
+
+## CLI
+
+After installation, the `config-explorer` command will become available:
+
+```bash
+# Run capacity planning
+config-explorer plan --model Qwen/Qwen2.5-3B --gpu-memory 80 --max-model-len 16000
+
+# Run GPU recommendation and performance estimation (BentoML's roofline model)
+config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --max-gpus 8
+
+# Human-readable output
+config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --pretty
+
+# Override GPU costs with custom pricing
+config-explorer estimate --model Qwen/Qwen2.5-3B \
+  --input-len 512 --output-len 128 \
+  --custom-gpu-cost H100:30.50 \
+  --custom-gpu-cost A100:22 \
+  --custom-gpu-cost L40:25.00 \
+  --pretty
+
+# Start the Streamlit web app
+pip install -r requirements-streamlit.txt # one-time installation (run from config_explorer/ dir)
+config-explorer start
+
+# Get help
+config-explorer --help
+```
+
+## Web Application
+
+A Streamlit frontend is provided to showcase the capabilities of the Configuration Explorer in a more intuitive way. Before using this frontend additional requirements must be installed.
+
+After installing Streamlit requirements (`pip install -r requirements-streamlit.txt`) the web app may then be started with
+```bash
+cd config_explorer  # must run from within the config_explorer directory
+config-explorer start
+```
+
+### Pages
+
+The Streamlit frontend includes the following pages:
+
+1. **Capacity Planner** - Analyze GPU memory requirements and capacity planning for LLM models
+2. **GPU Recommender** - Get optimal GPU recommendations based on model and workload requirements
+
+### Using the GPU Recommender
+
+The GPU Recommender page helps you find the optimal GPU for running LLM inference. To use it:
+
+1. **Configure Model**: Enter a HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`)
+2. **Set Workload Parameters**:
+   - Input sequence length (tokens)
+   - Output sequence length (tokens)
+   - Maximum number of GPUs
+3. **Define Constraints (Optional)**:
+   - Maximum Time to First Token (TTFT) in milliseconds
+   - Maximum Inter-Token Latency (ITL) in milliseconds
+   - Maximum End-to-End Latency in seconds
+4. **Run Analysis**: Click the "Run Analysis" button to evaluate all available GPUs
+5. **Review Results**:
+   - Compare GPUs through interactive visualizations
+   - Examine throughput, latency metrics, and optimal concurrency
+   - View detailed analysis for each GPU
+6. **Export**: Download results as JSON or CSV for further analysis
+
+The GPU Recommender uses BentoML's llm-optimizer roofline algorithm to provide synthetic performance estimates across different GPU types, helping you make informed decisions about hardware selection.
+
+**Note**: You'll need a HuggingFace token set as the `HF_TOKEN` environment variable to access gated models.
+
+### Cost Information
+
+The GPU Recommender displays cost information to help you find cost-effective GPU configurations:
+
+- **Default GPU Costs**: Built-in reference costs for common GPUs (H200, H100, A100, L40, etc.)
+- **Custom Cost Override**: Specify your own GPU costs using any numbers you prefer (e.g., your actual $/hour or $/token pricing)
+- **Cost-Based Sorting**: Sort results by cost to find the most economical option
+
+**⚠️ IMPORTANT**: Default costs are **reference values for relative comparison only**. They do **NOT** represent actual pricing from any provider. Lower values indicate better value. Use custom costs that reflect your actual infrastructure pricing.
+
+## Library
+
+For GPU recommender API usage see [./examples/gpu_recommender_example.py](./examples/gpu_recommender_example.py).
diff --git a/src/config_explorer/__init__.py b/src/config_explorer/__init__.py
diff --git a/src/config_explorer/db.json b/src/config_explorer/db.json
@@ -0,0 +1,50 @@
+{
+    "AMD_INSTINCT_MI300X": {
+        "memory": 192,
+        "prefix": "MI300X"
+    },
+    "NVIDIA-H100-80GB-HBM3": {
+        "memory": 80,
+        "prefix": "H100"
+    },
+    "NVIDIA-A100-40GB": {
+        "memory": 40,
+        "prefix": "A100"
+    },
+    "NVIDIA-A100-80GB": {
+        "memory": 80,
+        "prefix": "A100"
+    },
+    "NVIDIA-H100-80GB": {
+        "memory": 80,
+        "prefix": "H100"
+    },
+    "NVIDIA-L40-40GB": {
+        "memory": 40,
+        "prefix": "L40"
+    },
+    "NVIDIA-RTX-4090": {
+        "memory": 24,
+        "prefix": "RTX4090"
+    },
+    "NVIDIA-RTX-5090": {
+        "memory": 32,
+        "prefix": "RTX5090"
+    },
+    "NVIDIA-RTX-6000": {
+        "memory": 48,
+        "prefix": "RTX6000"
+    },
+    "NVIDIA-A6000": {
+        "memory": 48,
+        "prefix": "A6000"
+    },
+    "NVIDIA-A4000": {
+        "memory": 16,
+        "prefix": "A4000"
+    },
+    "NVIDIA-T4": {
+        "memory": 16,
+        "prefix": "T4"
+    }
+}
diff --git a/src/config_explorer/db.py b/src/config_explorer/db.py
@@ -0,0 +1,11 @@
+"""
+Mocks DB storing info about common accelerators used for LLM serving and inference
+"""
+import json
+import os
+
+gpu_specs = {}
+
+_dir = os.path.dirname(os.path.abspath(__file__))
+with open(os.path.join(_dir, "db.json")) as f:
+    gpu_specs = json.load(f)
diff --git a/src/config_explorer/empirical-vllm-memory-results.md b/src/config_explorer/empirical-vllm-memory-results.md
@@ -0,0 +1,179 @@
+# vLLM Empirical Memory Profiling Results
+
+Test environment: H100 GPU (79.18 GiB), vLLM with FlashAttention, `VLLM_LOGGING_LEVEL=DEBUG`.
+
+All tests use `--enable-prefix-caching --block-size=128`. Default `--gpu-memory-utilization=0.9` unless noted.
+
+## Summary
+
+| Model | Weights | Activation | Non-torch | CUDA Graph | KV Cache | TP | Util | max-model-len |
+| ----- | ------- | ---------- | --------- | ---------- | -------- | -- | ---- | ------------- |
+| gpt-oss-20b (MoE) | 13.47 | 7.38 | 0.13 | 0.39 | 50.28 | 1 | 0.9 | 16000 |
+| gpt-oss-120b (MoE) | 64.38 | 7.38 | 0.13 | 1.03 | 3.33 | 1 | 0.9 | 16000 |
+| Llama-3.3-70B-FP8 | 33.88 | 4.84 | 0.55 | -0.42 | 32.00 | 2 | 0.9 | 16000 |
+| Llama-3.1-8B | 14.99 | 4.76 | 0.13 | -0.45 | 51.38 | 1 | 0.9 | 16000 |
+| Qwen3-0.6B | 1.12 | 5.56 | 0.13 | 0.10 | 64.45 | 1 | 0.9 | 16000 |
+| Qwen3-32B | 61.03 | 5.64 | 0.14 | -0.88 | 4.45 | 1 | 0.9 | 16000 |
+| Qwen3-32B | 30.59 | 5.64 | 0.54 | -0.33 | 34.49 | 2 | 0.9 | 16000 |
+| Mistral-Small-3.2-24B | 44.76 | 2.12 | 0.14 | -0.76 | 28.20 | 1 | 0.95 | 16000 |
+
+All values in GiB. "Activation" = torch peak memory increase. "CUDA Graph" = memory change during graph capture (negative = freed).
+
+### Failed Configurations
+
+| Model | TP | Failure | Root Cause |
+| ----- | -- | ------- | ---------- |
+| Deepseek-R1 (FP8) | 1 | OOM during load | Weights exceeded single GPU; needs TP |
+| Llama-3.3-70B-FP8 | 1 | No KV cache room | 67.72 GiB weights, -1.44 GiB remaining; use TP=2 |
+| Qwen3-32B | 1 | No KV cache room | 61.03 GiB weights at max-model-len=32000; use TP=2 or reduce context |
+
+## Key Patterns
+
+**Activation memory is constant per model type** (independent of max-model-len and batch-size):
+- Multimodal: ~2.1 GiB (vision encoder skips CUDA graph capture)
+- Dense text-only: ~4.8-5.6 GiB
+- MoE: ~7.4 GiB
+
+**Non-torch memory** scales with TP: ~0.13 GiB (TP=1), ~0.55 GiB (TP=2).
+
+**CUDA graph memory** ranges from -0.88 to +1.03 GiB. Negative values (memory freed) are common for large dense models.
+
+**Activation is constant across context lengths**: Qwen3-0.6B at max-model-len=16000 and max-model-len=32000 both measured 5.56 GiB activation and 64.45 GiB KV cache.
+
+## Per-Model Notes
+
+### gpt-oss-20b / gpt-oss-120b (MoE)
+
+- **Model:** openai/gpt-oss-20b, openai/gpt-oss-120b
+- MoE models have the highest activation memory (~7.38 GiB) due to expert routing overhead
+- gpt-oss-120b barely fits on a single H100 (64.38 GiB weights, only 3.33 GiB for KV cache)
+
+### Llama-3.3-70B-FP8
+
+- **Model:** RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
+- Requires TP=2 (67.72 GiB weights at TP=1 leaves no room for KV cache)
+- At TP=2: 33.88 GiB weights per GPU, 32.0 GiB KV cache available
+
+### Llama-3.1-8B
+
+- **Model:** meta-llama/Llama-3.1-8B-Instruct
+- Small footprint (14.99 GiB), generous KV cache (51.38 GiB)
+
+### Qwen3-0.6B / Qwen3-32B
+
+- **Models:** Qwen/Qwen3-0.6B, Qwen/Qwen3-32B
+- Qwen3-0.6B: smallest model tested, 64.45 GiB KV cache available
+- Qwen3-32B at TP=1: only 4.45 GiB KV cache (tight); TP=2 gives 34.49 GiB
+
+### Mistral-Small-3.2-24B
+
+- **Model:** mistralai/Mistral-Small-3.2-24B-Instruct-2506
+- **Architecture:** Mistral3ForConditionalGeneration (multimodal / vision-language)
+- **vLLM:** v0.11.0 (V1 engine), `--gpu-memory-utilization=0.95`, `--tokenizer-mode=mistral --config-format=mistral --load-format=mistral`
+- **Notable:** Lowest activation memory measured (2.12 GiB), likely because vision encoder does not participate in CUDA graph capture
+
+**Model architecture:** GQA, 40 layers, 32 attention heads, 8 KV heads, head_dim=128, hidden_size=5120
+
+**KV cache validation** -- per-token formula matches vLLM exactly:
+
+```
+Per-token KV = num_layers x 2 x head_dim x num_kv_heads x dtype_bytes
+             = 40 x 2 x 128 x 8 x 2 = 163,840 bytes (160 KB/token)
+
+vLLM empirical: 28.20 GiB / 184,832 tokens = 163,840 bytes/token  (exact match)
+```
+
+**Live request validation** (15,049 tokens, measured via Prometheus /metrics):
+
+| Metric | Measured | Expected |
+| ------ | -------- | -------- |
+| KV cache usage | 8.18% | 8.17% (118 blocks / 1,444 total) |
+| Blocks allocated | 118 | ceil(15,049 / 128) = 118 |
+| Prompt throughput | ~1,481 tok/s | -- |
+| Prefix cache hit rate | 30% | -- |
+
+**Capacity planner accuracy** (before/after adding validated activation profiles):
+
+| Metric | Before | After | vLLM Actual |
+| ------ | ------ | ----- | ----------- |
+| Activation estimate | 5.5 GiB | 2.5 GiB | 2.12 GiB |
+| Available KV cache | 24.82 GiB | 27.82 GiB | 28.20 GiB |
+| Error | -3.38 GiB | **-0.38 GiB** | -- |
+| Max concurrent @16K | 10.2x | **11.4x** | 11.55x |
+
+## How to Replicate
+
+### Setup
+
+Requirements: Kubernetes cluster with H100 GPU nodes, HuggingFace token secret.
+
+Deploy a vLLM pod with `VLLM_LOGGING_LEVEL=DEBUG`:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: vllm-profiling
+spec:
+  restartPolicy: Never
+  containers:
+    - name: vllm
+      image: vllm/vllm-openai:v0.11.0
+      command: ["vllm", "serve"]
+      args:
+        - <model-name>                    # e.g. Qwen/Qwen3-32B
+        - --tensor-parallel-size=<tp>     # 1 or 2
+        - --gpu-memory-utilization=0.90
+        - --max-model-len=16000
+        - --block-size=128
+        - --enable-prefix-caching
+        - --host=0.0.0.0
+        - --port=8000
+      resources:
+        requests:
+          nvidia.com/gpu: "<tp>"          # must match tensor-parallel-size
+        limits:
+          nvidia.com/gpu: "<tp>"
+      env:
+        - name: HF_TOKEN
+          valueFrom:
+            secretKeyRef: { name: llm-d-hf-token, key: HF_TOKEN }
+        - name: VLLM_LOGGING_LEVEL
+          value: DEBUG
+        - name: HF_HOME
+          value: /tmp/cache
+      volumeMounts:
+        - { name: cache, mountPath: /tmp/cache }
+  volumes:
+    - { name: cache, emptyDir: {} }
+```
+
+Wait for "Application startup complete" in logs.
+
+### Extract Metrics
+
+Search the pod logs for these strings:
+
+| Log substring | What it gives you |
+| ------------- | ----------------- |
+| `"Model loading took"` | Weight memory (GiB) and load time |
+| `"torch peak memory increase"` | Activation memory (GiB) |
+| `"non-torch forward increase memory"` | Non-torch memory (GiB) |
+| `"Available KV cache memory"` | KV cache allocation (GiB) |
+| `"Free memory on device"` | Total/free GPU memory at startup |
+| `"GPU KV cache size"` | Total KV cache tokens and block count |
+| `"Maximum concurrency for"` | Max concurrent requests at max-model-len |
+
+### Validate KV Cache at Runtime
+
+```bash
+# Port-forward to the pod
+kubectl port-forward pod/<name> -n <ns> 8000:8000 &
+
+# Send a request and check metrics
+curl -X POST localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
+  -d '{"model":"<model>","messages":[{"role":"user","content":"<long prompt>"}],"max_tokens":10}'
+
+# Check KV cache usage
+curl -s localhost:8000/metrics | grep kv_cache_usage_perc
+```
-Original file line number
+Diff line change
@@ Expand Up / @@ -64,6 +64,7 @@ exclude = [ @@
         "dist",
         "generated_configs",
         "logs",
+        "src/config_explorer",
     ]
     [tool.ruff.lint]
@@ Expand Down @@