Extensive validation of capacity planner memory and KV cache calculations against real vLLM
Background
The capacity planner computes GPU memory requirements and KV cache allocations using formulas and
constants in src/planner/capacity_planner.py. The existing constants were derived from a limited
set of models and hardware configurations. This issue tracks an extensive validation campaign across
a wider variety of model architectures using the latest vLLM release, and recalibrates any constants that have drifted.
The results should be published as a detailed technical report consumable by the open-source
community, documenting methodology, raw measurements, and any constant changes made.
What to Validate
For each (model, GPU, TP) combination, compare planner predictions against vLLM actuals for the
following five quantities:
1. Model weight memory (model_memory_req, line 553)
Planner reads weight memory from safetensors tensor metadata and sums num_params × dtype_bytes
per dtype. Compare against vLLM's reported torch allocated at startup, before KV cache
allocation begins (captured from the vLLM startup log line "weights loaded").
2. Activation memory (estimate_vllm_activation_memory, line 380)
Planner uses hardcoded constants per architecture class:
| Architecture class |
Constant (GiB) |
Status |
LlamaForCausalLM |
4.8 |
Validated profile |
Qwen2ForCausalLM |
5.6 |
Validated profile |
Qwen3ForCausalLM |
5.6 |
Validated profile (assumed same as Qwen2) |
GemmaForCausalLM |
5.5 |
Dense fallback |
PhiForCausalLM |
5.5 |
Dense fallback |
GraniteForCausalLM |
5.5 |
Dense fallback |
QWenLMHeadModel |
5.5 |
Dense fallback |
Qwen3MoeForCausalLM |
8.0 |
MoE fallback |
MistralForCausalLM / MixtralForCausalLM |
8.0 |
MoE fallback |
PixtralForConditionalGeneration |
2.5 |
Validated multimodal profile |
vLLM performs a dummy forward pass during engine warmup and logs the peak activation memory
directly. Read activation_memory_used_gb from the vLLM startup log (look for the memory
profiling summary line emitted after the dummy run). No inference call or derived formula needed.
3. KV cache per-token memory (KVCacheDetail, line 173)
Standard attention (MHA/GQA/MQA):
per_token_bytes = num_hidden_layers × 2 × num_key_value_heads × head_dim × precision_bytes
Multi-head Latent Attention (MLA):
per_token_bytes = num_hidden_layers × (kv_lora_rank + qk_rope_head_dim) × precision_bytes
Validate purely from startup data — no inference call required. vLLM logs the allocated block
count at startup:
The actual KV pool size is therefore:
actual_kv_gib = N × block_size × per_token_bytes / 1024³ (block_size = 16 tokens by default)
Compare against the planner's allocatable_kv_cache_memory() output directly.
4. Non-torch memory overhead (estimate_vllm_non_torch_memory, line 29)
Constants: 0.15 GiB for TP=1, 0.6 GiB for TP≥2. vLLM's startup memory profiling summary also
logs non_torch_memory_used_gb — read it directly from the same log line as activation memory.
5. Total allocatable KV cache (allocatable_kv_cache_memory, line 855)
The master formula the planner uses:
allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
- (model_weights_per_gpu × DP)
- (activation_memory × DP)
- non_torch_overhead
DP is data parallelism. All runs in this campaign use standard single-process vLLM with TP-only
parallelism, so DP=1 throughout and the formula reduces to:
allocatable_kv = (gpu_memory × gpu_util × num_GPUs) - model_weights_per_gpu - activation_memory - non_torch_overhead
Compute end-to-end error against actual KV memory derived from block count:
actual_kv_gib = (vllm_kv_blocks × 16 × per_token_bytes_from_model_config) / 1024³
error% = (planner_allocatable - actual_kv_gib) / actual_kv_gib × 100
Model Matrix
Models are grouped by family. Each family section notes whether it has an existing validated
profile or falls through to a generic constant, and what architectural question the family answers.
Meta / Llama family
Dense and MoE variants from the same base architecture, spanning small to large and covering the
validated LlamaForCausalLM profile plus the Llama-based MoE class.
| Model |
Architecture class |
Type |
Notes |
meta-llama/Llama-3.1-8B-Instruct |
LlamaForCausalLM |
Dense |
Has validated profile (4.8 GiB) — primary baseline |
meta-llama/Llama-3.3-70B-Instruct |
LlamaForCausalLM |
Dense |
Tests validated constant at 70B scale |
meta-llama/Llama-4-Scout-17B-16E-Instruct |
LlamaForCausalLM (MoE variant) |
MoE |
Llama-based MoE — architecturally distinct from Mixtral; uses MoE fallback |
redhatai/Llama-3.3-70B-Instruct-quantized.w8a8 |
LlamaForCausalLM |
Dense quantized |
Tests quantized weight memory path |
Qwen family (full generational sweep)
Each Qwen generation introduces a new architecture class. Testing all generations reveals whether
constants need per-generation tuning or whether a shared constant suffices.
| Model |
Architecture class |
Type |
Notes |
Qwen/Qwen-7B-Chat |
QWenLMHeadModel |
Dense |
Qwen1; no validated profile, uses dense fallback |
Qwen/Qwen2.5-7B-Instruct |
Qwen2ForCausalLM |
Dense |
Has validated profile (5.6 GiB) |
Qwen/Qwen2.5-72B-Instruct |
Qwen2ForCausalLM |
Dense |
Tests validated constant at 72B scale |
Qwen/Qwen3-8B |
Qwen3ForCausalLM |
Dense |
Validated profile assumed equal to Qwen2 — confirm |
Qwen/Qwen3-30B-A3B |
Qwen3MoeForCausalLM |
MoE |
128 experts, 8 active; uses MoE fallback constant |
| Qwen-next (latest instruct at time of benchmarking) |
TBD |
TBD |
Add validated profile if a new architecture class is introduced |
Mistral family
Sparse MoE with a different expert structure and routing than Llama MoE or Qwen3 MoE; the
canonical reference model for the MoE fallback constant.
| Model |
Architecture class |
Type |
Notes |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
MixtralForCausalLM |
MoE |
8 experts, 2 active; primary MoE fallback baseline |
mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
MistralForCausalLM |
Dense |
Tests dense fallback for Mistral architecture at 24B |
Microsoft / Phi family
Small dense models; uses dense fallback constant. Worth validating because Phi models are
memory-efficient and the activation overhead ratio relative to model size is higher than larger models.
| Model |
Architecture class |
Type |
Notes |
microsoft/phi-4 |
Phi3ForCausalLM |
Dense |
Uses dense fallback (5.5 GiB) |
Google / Gemma family
Dense GQA models; uses dense fallback constant. Testing small and large variants confirms whether
the constant holds across a 7× parameter range within the same family.
| Model |
Architecture class |
Type |
Notes |
google/gemma-3-4b-it |
Gemma3ForCausalLM |
Dense |
Dense fallback, small |
google/gemma-3-27b-it |
Gemma3ForCausalLM |
Dense |
Dense fallback, large — tests constant under memory pressure |
IBM / Granite family
Granite spans multiple generations (3.1, 3.3, 4.x), a small/large size axis, and a multimodal
variant — all under GraniteForCausalLM for text models and LlavaNextForConditionalGeneration
for vision. This makes it well-suited for checking whether the dense fallback constant holds
across generations and scales, and whether the multimodal constant (currently only validated on
Pixtral) generalizes to a second architecture.
| Model |
Architecture class |
Type |
Notes |
ibm-granite/granite-3.1-2b-instruct |
GraniteForCausalLM |
Dense |
Smallest variant; activation overhead ratio is highest relative to model size here |
ibm-granite/granite-3.1-8b-instruct |
GraniteForCausalLM |
Dense |
Dense fallback baseline |
ibm-granite/granite-3.3-8b-instruct |
GraniteForCausalLM |
Dense |
Newer generation — confirms constant did not change across 3.1 → 3.3 |
ibm-granite/granite-4.0-8b-instruct (or latest 4.x at benchmarking time) |
TBD |
Dense |
If 4.x introduces a new architecture class, add a validated profile; otherwise confirms 3.x constant carries forward |
ibm-granite/granite-vision-3.3-2b |
LlavaNextForConditionalGeneration |
Multimodal |
Multimodal fallback (2.5 GiB); second architecture in this class after Pixtral — tests generalizability of the constant |
MLA coverage (DeepSeek)
The planner has a dedicated KV cache formula for Multi-head Latent Attention (MLA), but no MLA
model currently appears in the matrix — meaning the formula has never been validated end-to-end.
DeepSeek-V2-Lite (16B total, 2.4B active) is the only practical single-node MLA option; larger
DeepSeek models require multi-node and are out of scope.
| Model |
Architecture class |
Type |
Notes |
deepseek-ai/DeepSeek-V2-Lite-Chat |
DeepseekV2ForCausalLM |
MoE + MLA |
Only single-node MLA model available; validates the MLA KV formula; uses MoE fallback for activation |
OpenAI OSS family
Novel MoE architecture (gpt_oss) not yet in VALIDATED_ACTIVATION_PROFILES; falls through to
the MoE fallback. Both models are already in the benchmark database. Testing at two scales
(20B and 120B total params) tests whether the fallback holds across sizes within this family and
may motivate a dedicated validated profile.
| Model |
Architecture class |
Type |
Notes |
openai/gpt-oss-20b |
GptOssForCausalLM |
MoE |
32 experts, 4 active; 21B total / 3.6B active |
openai/gpt-oss-120b |
GptOssForCausalLM |
MoE |
117B total / 5.1B active |
Tensor Parallelism Matrix
For every model, iterate over all TP values where the model fits in GPU memory. The non-torch
overhead constant changes at TP≥2 — this transition point is explicitly worth verifying.
| Model size |
TP values to test |
| ≤8B dense |
TP=1, TP=2, TP=4 |
| 14B–32B dense |
TP=1, TP=2, TP=4 |
| 70B–72B dense |
TP=2, TP=4, TP=8 |
| MoE (30B–70B active) |
TP=1, TP=2, TP=4 |
| Quantized 70B |
TP=1, TP=2, TP=4 |
For each TP step, record the non-torch overhead separately to confirm whether the TP=1 / TP≥2
binary constant is sufficient or whether a per-TP-value table is more accurate.
vLLM Argument Sensitivity
This section isolates the effect of individual vLLM launch arguments on each memory component.
All runs use meta-llama/Llama-3.1-8B-Instruct (primary) and Qwen/Qwen2.5-7B-Instruct
(secondary) on H100 80GB, holding all other arguments at the standard values from the
vLLM Configuration section and varying exactly one argument at a time.
--max-model-len
| Sweep values |
Models |
| 2048, 4096, 8192, 16384, 32768 |
Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1) |
KV block count scales linearly with max_model_len; weight memory and activation
memory are unaffected. Record any deviation — non-linear KV scaling or changes to activation memory
would indicate vLLM internal behavior (e.g., CUDA graph capture bucketing at different context
sizes) that the planner must account for.
--pipeline-parallel-size (PP)
| Sweep values |
Models |
Hardware requirement |
| PP=1, 2, 4 |
Llama-3.1-8B-Instruct (TP=1) |
1, 2, 4× H100 80GB |
With PP=N, each GPU hosts approximately 1/N of the model layers, so weight memory
per GPU ≈ total / N and activation memory per GPU ≈ baseline / N. Non-torch overhead is expected
to increase per GPU due to inter-stage communication buffers. The current planner formula does not
model PP — quantify the per-GPU error across PP values and determine whether a PP correction factor
is needed.
Measure all five quantities (weight, activation, non-torch, KV blocks, total allocatable KV) per
GPU at each PP value and record the per-stage breakdown where the vLLM log exposes it.
--data-parallel-size (DP)
| Sweep values |
Models |
Hardware requirement |
| DP=1, 2 |
Llama-3.1-8B-Instruct (TP=1) |
1, 2× H100 80GB |
Each DP replica is an independent process owning a full model copy, so per-GPU
memory usage for a DP=2 instance should be identical to a DP=1 instance. The planner formula
multiplies weight and activation memory by DP:
allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
- (model_weights_per_gpu × DP)
- (activation_memory × DP)
- non_torch_overhead
If per-GPU measurements are unchanged across DP values, the DP multiplier in the formula models
total cluster memory consumed (across all replicas), not per-GPU allocation — document which
interpretation is correct and whether the formula needs clarification or correction.
Interaction summary
These three arguments control distinct memory regions but are coupled in two ways that matter
for experimental design:
-
kv_cache_dtype=auto inherits from dtype: when --kv-cache-dtype auto (the default),
vLLM sets the KV cache precision equal to the model's activation dtype. Sweeping --dtype
without pinning --kv-cache-dtype therefore changes both weight/activation memory and KV
cache precision simultaneously — confounding the measurement. To isolate each effect, pin the
other argument explicitly.
-
Quantized weights de-quantize to dtype for compute: weights stored in INT4 (AWQ) or FP8
are expanded back to the dtype precision (e.g., BF16) during the forward pass. Activation
memory therefore depends on dtype, not on quantization format. A change in activation
memory observed between quantized and unquantized runs indicates a problem with the de-quant
path, not an expected quantization effect.
Sweep design consequence: all three sweeps below pin the arguments they are not varying to
float16 to avoid cross-contamination.
--dtype
Controls the precision used for model weights and activations during computation. auto matches
the dtype the model was originally saved in (typically BF16 for modern models).
Hold fixed: --kv-cache-dtype float16 (pin explicitly to prevent KV cache precision from
also changing when dtype changes — see interaction note above).
| Sweep values |
Models |
float16 (baseline), bfloat16 |
Llama-3.1-8B-Instruct (TP=1) |
BF16 ≈ FP16 in weight memory (both are 2 bytes/parameter — same bit-width,
different exponent range). Activation memory should also be equal. The planner's weight memory
formula reads dtype from safetensors metadata; verify it correctly handles a model loaded in a
dtype different from the one it was saved in. KV cache block count must be identical across
dtype values because --kv-cache-dtype float16 is pinned.
--quantization
Specifies the compression scheme applied to model weights at load time. Quantized weights are
stored in low precision (INT4/INT8/FP8) but de-quantized back to the dtype precision during
actual math operations — so activation memory depends on dtype, not on quantization format.
Hold fixed: --dtype float16, --kv-cache-dtype float16.
| Sweep values |
Models |
Notes |
| none (baseline) |
meta-llama/Llama-3.1-8B-Instruct (TP=1) |
FP16 weights |
fp8 |
redhatai/Llama-3.1-8B-Instruct-quantized.w8a8 (TP=1) |
FP8 weights; ~2× compression expected |
awq |
redhatai/Llama-3.1-8B-Instruct-quantized.w4a16 (TP=1) |
INT4 weights; ~4× compression expected |
Weight memory scales with quantization bit-width — AWQ INT4 ≈ 25% of FP16,
FP8 ≈ 50%. Activation memory should be equal across all three because de-quantization
restores the dtype=float16 precision for compute; any divergence in activation memory is a
finding to investigate. KV cache block count should be identical (kv_cache_dtype pinned).
The planner reads raw safetensors tensor dtypes; verify it sums quantized tensor sizes correctly
rather than treating all parameters as the model's nominal dtype.
Note: the primary matrix already includes redhatai/Llama-3.3-70B-Instruct-quantized.w8a8
for large-scale quantization coverage. This sweep isolates quantization format effects on a
single controlled model.
--kv-cache-dtype
Controls the precision used for the KV cache independently of weight and activation dtype.
auto inherits from the model's dtype — so with the defaults, KV cache is FP16 for FP16
models and BF16 for BF16 models.
Hold fixed: --dtype float16 (ensures auto baseline resolves to FP16, making the
comparison clean: auto=FP16 vs explicit fp8).
| Sweep values |
Models |
auto (=FP16 with dtype=float16), fp8 |
Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1) |
Switching from FP16 to FP8 halves the per-token KV cache byte cost, so GPU KV
block count should approximately double at the same memory budget:
per_token_bytes (FP16) = num_hidden_layers × 2 × num_kv_heads × head_dim × 2
per_token_bytes (FP8) = num_hidden_layers × 2 × num_kv_heads × head_dim × 1
Weight memory and activation memory must be identical across both runs (dtype is pinned).
The planner's KVCacheDetail formula currently derives precision_bytes from the model's
weight dtype — verify it accepts kv_cache_dtype as an explicit override. This is the most
operationally important test in this section: FP8 KV cache is a common production optimization
and a planner that ignores kv_cache_dtype will systematically underestimate available KV
memory for any deployment using it.
Hardware
- Primary: H100 80GB SXM (or PCIe)
- Secondary: A100 80GB, if available
gpu_memory_utilization = 0.95 (vLLM default) for all runs
vLLM Configuration
Use a consistent, minimal vLLM configuration to isolate baseline memory behavior:
vllm serve <model> \
--tensor-parallel-size <TP> \
--gpu-memory-utilization 0.95 \
--no-enable-prefix-caching \
--disable-log-requests \
--max-model-len 8192 # fixed context window to control KV pool size
Record the exact vLLM version tag and commit SHA in all output files. Re-run if a new minor
version is released before the benchmarking campaign completes.
Measurement Procedure
For each (model, GPU, vllm_args) run:
- Launch vLLM and capture the full startup log to a file
- Confirm clean startup — abort and flag the run if the log contains any of:
offloading, cudaMalloc failed, insufficient memory, retrying allocation
- Extract from the startup log:
- Weight memory:
torch allocated reported after weights load
- Activation memory:
activation_memory_used_gb from the memory profiling summary
(logged after the dummy warmup forward pass)
- Non-torch overhead:
non_torch_memory_used_gb from the same summary line
- KV block count:
# GPU blocks: N
- Compute actual KV pool size:
actual_kv_gib = N × 16 × per_token_bytes_from_model_config / 1024³
- Run planner's
allocatable_kv_cache_memory() with matching parameters
- Record all values plus
log_path and timestamp in a structured JSON output file per run
PP runs: vLLM assigns contiguous layer ranges to each pipeline stage; each stage emits its own
memory profiling summary. Collect per-stage logs and record weight/activation/non-torch for each
stage separately before summing to cross-check against the single-node total.
DP runs: each DP replica is an independent process — collect one log per replica. All replicas
should produce identical measurements; flag any divergence.
Reporting
Per-run output (JSON)
{
"model": "Qwen/Qwen2.5-7B-Instruct",
"gpu": "H100-80GB",
"vllm_args": {
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"data_parallel_size": 1,
"max_model_len": 8192,
"gpu_memory_utilization": 0.95,
"dtype": "auto",
"quantization": null,
"kv_cache_dtype": "auto"
},
"vllm_version": "0.x.y",
"vllm_commit": "abc123",
"timestamp": "2025-04-17T10:23:00Z",
"duration_seconds": 142,
"log_path": "logs/qwen2.5-7b-h100-tp1.log",
"measured": {
"weight_memory_gib": 14.2,
"activation_memory_gib": 5.3,
"non_torch_memory_gib": 0.14,
"kv_cache_blocks": 12480,
"kv_cache_gib": 58.1
},
"planner_predicted": {
"weight_memory_gib": 14.2,
"activation_memory_gib": 5.6,
"non_torch_memory_gib": 0.15,
"kv_cache_gib": 57.6
},
"error_pct": {
"weight_memory": 0.0,
"activation_memory": 5.7,
"non_torch_memory": 7.1,
"kv_cache": -0.9
}
}
Summary report (Markdown)
Publish as docs/benchmarks/memory-validation-report.md containing:
- Per-component error table: rows = models, columns = weight / activation / non-torch / KV cache allocatable, cells =
error%
- Per-architecture error table: grouped by architecture class, showing mean and max absolute error per component
- TP sensitivity table: for each model, show how activation and non-torch overhead vary with TP
- Argument sensitivity tables (one per argument from the vLLM Argument Sensitivity section):
max_model_len: KV blocks vs. context length — confirm linearity or document deviation
- PP: per-GPU weight / activation / non-torch / KV at each PP value; note planner formula gap
- DP: per-replica measurements across DP values; document whether DP multiplier is per-GPU or cluster-total
dtype: weight and activation memory vs. precision; verify planner dtype path
quantization: weight memory compression ratio vs. format (AWQ, FP8); verify safetensors metadata path
kv_cache_dtype: KV block count vs. cache precision (FP16 vs FP8); confirm ~2× block count for FP8
- Outliers: any model where any component exceeds ±10% error — root cause analysis required
- Calibration decisions: document each constant changed, old value → new value, evidence
Deliverables
Out of Scope
- Latency or throughput accuracy
- Multi-node / multi-host tensor parallelism
- Prefix caching effects
- Speculative decoding
- Roofline estimation accuracy
Related
src/planner/capacity_planner.py — all memory and KV cache formulas
src/planner/recommendation/estimator.py:209 — uses check_model_fits_gpu for memory feasibility
src/planner/recommendation/config_finder.py — calls estimator for configs without benchmarks
References
- vllm_config_estimator — GPU memory and
KV cache estimation tool for vLLM deployments; useful comparison point for formula validation
- GPU Calculator — interactive GPU memory calculator for
LLM inference; reference for cross-checking capacity planning outputs
Extensive validation of capacity planner memory and KV cache calculations against real vLLM
Background
The capacity planner computes GPU memory requirements and KV cache allocations using formulas and
constants in
src/planner/capacity_planner.py. The existing constants were derived from a limitedset of models and hardware configurations. This issue tracks an extensive validation campaign across
a wider variety of model architectures using the latest vLLM release, and recalibrates any constants that have drifted.
The results should be published as a detailed technical report consumable by the open-source
community, documenting methodology, raw measurements, and any constant changes made.
What to Validate
For each
(model, GPU, TP)combination, compare planner predictions against vLLM actuals for thefollowing five quantities:
1. Model weight memory (
model_memory_req, line 553)Planner reads weight memory from safetensors tensor metadata and sums
num_params × dtype_bytesper dtype. Compare against vLLM's reported
torch allocatedat startup, before KV cacheallocation begins (captured from the vLLM startup log line
"weights loaded").2. Activation memory (
estimate_vllm_activation_memory, line 380)Planner uses hardcoded constants per architecture class:
LlamaForCausalLMQwen2ForCausalLMQwen3ForCausalLMGemmaForCausalLMPhiForCausalLMGraniteForCausalLMQWenLMHeadModelQwen3MoeForCausalLMMistralForCausalLM/MixtralForCausalLMPixtralForConditionalGenerationvLLM performs a dummy forward pass during engine warmup and logs the peak activation memory
directly. Read
activation_memory_used_gbfrom the vLLM startup log (look for the memoryprofiling summary line emitted after the dummy run). No inference call or derived formula needed.
3. KV cache per-token memory (
KVCacheDetail, line 173)Standard attention (MHA/GQA/MQA):
Multi-head Latent Attention (MLA):
Validate purely from startup data — no inference call required. vLLM logs the allocated block
count at startup:
The actual KV pool size is therefore:
Compare against the planner's
allocatable_kv_cache_memory()output directly.4. Non-torch memory overhead (
estimate_vllm_non_torch_memory, line 29)Constants: 0.15 GiB for TP=1, 0.6 GiB for TP≥2. vLLM's startup memory profiling summary also
logs
non_torch_memory_used_gb— read it directly from the same log line as activation memory.5. Total allocatable KV cache (
allocatable_kv_cache_memory, line 855)The master formula the planner uses:
DPis data parallelism. All runs in this campaign use standard single-process vLLM with TP-onlyparallelism, so
DP=1throughout and the formula reduces to:Compute end-to-end error against actual KV memory derived from block count:
Model Matrix
Models are grouped by family. Each family section notes whether it has an existing validated
profile or falls through to a generic constant, and what architectural question the family answers.
Meta / Llama family
Dense and MoE variants from the same base architecture, spanning small to large and covering the
validated
LlamaForCausalLMprofile plus the Llama-based MoE class.meta-llama/Llama-3.1-8B-InstructLlamaForCausalLMmeta-llama/Llama-3.3-70B-InstructLlamaForCausalLMmeta-llama/Llama-4-Scout-17B-16E-InstructLlamaForCausalLM(MoE variant)redhatai/Llama-3.3-70B-Instruct-quantized.w8a8LlamaForCausalLMQwen family (full generational sweep)
Each Qwen generation introduces a new architecture class. Testing all generations reveals whether
constants need per-generation tuning or whether a shared constant suffices.
Qwen/Qwen-7B-ChatQWenLMHeadModelQwen/Qwen2.5-7B-InstructQwen2ForCausalLMQwen/Qwen2.5-72B-InstructQwen2ForCausalLMQwen/Qwen3-8BQwen3ForCausalLMQwen/Qwen3-30B-A3BQwen3MoeForCausalLMMistral family
Sparse MoE with a different expert structure and routing than Llama MoE or Qwen3 MoE; the
canonical reference model for the MoE fallback constant.
mistralai/Mixtral-8x7B-Instruct-v0.1MixtralForCausalLMmistralai/Mistral-Small-3.1-24B-Instruct-2503MistralForCausalLMMicrosoft / Phi family
Small dense models; uses dense fallback constant. Worth validating because Phi models are
memory-efficient and the activation overhead ratio relative to model size is higher than larger models.
microsoft/phi-4Phi3ForCausalLMGoogle / Gemma family
Dense GQA models; uses dense fallback constant. Testing small and large variants confirms whether
the constant holds across a 7× parameter range within the same family.
google/gemma-3-4b-itGemma3ForCausalLMgoogle/gemma-3-27b-itGemma3ForCausalLMIBM / Granite family
Granite spans multiple generations (3.1, 3.3, 4.x), a small/large size axis, and a multimodal
variant — all under
GraniteForCausalLMfor text models andLlavaNextForConditionalGenerationfor vision. This makes it well-suited for checking whether the dense fallback constant holds
across generations and scales, and whether the multimodal constant (currently only validated on
Pixtral) generalizes to a second architecture.
ibm-granite/granite-3.1-2b-instructGraniteForCausalLMibm-granite/granite-3.1-8b-instructGraniteForCausalLMibm-granite/granite-3.3-8b-instructGraniteForCausalLMibm-granite/granite-4.0-8b-instruct(or latest 4.x at benchmarking time)ibm-granite/granite-vision-3.3-2bLlavaNextForConditionalGenerationMLA coverage (DeepSeek)
The planner has a dedicated KV cache formula for Multi-head Latent Attention (MLA), but no MLA
model currently appears in the matrix — meaning the formula has never been validated end-to-end.
DeepSeek-V2-Lite (16B total, 2.4B active) is the only practical single-node MLA option; larger
DeepSeek models require multi-node and are out of scope.
deepseek-ai/DeepSeek-V2-Lite-ChatDeepseekV2ForCausalLMOpenAI OSS family
Novel MoE architecture (
gpt_oss) not yet inVALIDATED_ACTIVATION_PROFILES; falls through tothe MoE fallback. Both models are already in the benchmark database. Testing at two scales
(20B and 120B total params) tests whether the fallback holds across sizes within this family and
may motivate a dedicated validated profile.
openai/gpt-oss-20bGptOssForCausalLMopenai/gpt-oss-120bGptOssForCausalLMTensor Parallelism Matrix
For every model, iterate over all TP values where the model fits in GPU memory. The non-torch
overhead constant changes at TP≥2 — this transition point is explicitly worth verifying.
For each TP step, record the non-torch overhead separately to confirm whether the TP=1 / TP≥2
binary constant is sufficient or whether a per-TP-value table is more accurate.
vLLM Argument Sensitivity
This section isolates the effect of individual vLLM launch arguments on each memory component.
All runs use
meta-llama/Llama-3.1-8B-Instruct(primary) andQwen/Qwen2.5-7B-Instruct(secondary) on H100 80GB, holding all other arguments at the standard values from the
vLLM Configuration section and varying exactly one argument at a time.
--max-model-lenKV block count scales linearly with
max_model_len; weight memory and activationmemory are unaffected. Record any deviation — non-linear KV scaling or changes to activation memory
would indicate vLLM internal behavior (e.g., CUDA graph capture bucketing at different context
sizes) that the planner must account for.
--pipeline-parallel-size(PP)With PP=N, each GPU hosts approximately 1/N of the model layers, so weight memory
per GPU ≈ total / N and activation memory per GPU ≈ baseline / N. Non-torch overhead is expected
to increase per GPU due to inter-stage communication buffers. The current planner formula does not
model PP — quantify the per-GPU error across PP values and determine whether a PP correction factor
is needed.
Measure all five quantities (weight, activation, non-torch, KV blocks, total allocatable KV) per
GPU at each PP value and record the per-stage breakdown where the vLLM log exposes it.
--data-parallel-size(DP)Each DP replica is an independent process owning a full model copy, so per-GPU
memory usage for a DP=2 instance should be identical to a DP=1 instance. The planner formula
multiplies weight and activation memory by DP:
If per-GPU measurements are unchanged across DP values, the DP multiplier in the formula models
total cluster memory consumed (across all replicas), not per-GPU allocation — document which
interpretation is correct and whether the formula needs clarification or correction.
Interaction summary
These three arguments control distinct memory regions but are coupled in two ways that matter
for experimental design:
kv_cache_dtype=autoinherits fromdtype: when--kv-cache-dtype auto(the default),vLLM sets the KV cache precision equal to the model's activation dtype. Sweeping
--dtypewithout pinning
--kv-cache-dtypetherefore changes both weight/activation memory and KVcache precision simultaneously — confounding the measurement. To isolate each effect, pin the
other argument explicitly.
Quantized weights de-quantize to
dtypefor compute: weights stored in INT4 (AWQ) or FP8are expanded back to the
dtypeprecision (e.g., BF16) during the forward pass. Activationmemory therefore depends on
dtype, not onquantizationformat. A change in activationmemory observed between quantized and unquantized runs indicates a problem with the de-quant
path, not an expected quantization effect.
Sweep design consequence: all three sweeps below pin the arguments they are not varying to
float16to avoid cross-contamination.--dtypeControls the precision used for model weights and activations during computation.
automatchesthe dtype the model was originally saved in (typically BF16 for modern models).
Hold fixed:
--kv-cache-dtype float16(pin explicitly to prevent KV cache precision fromalso changing when dtype changes — see interaction note above).
float16(baseline),bfloat16BF16 ≈ FP16 in weight memory (both are 2 bytes/parameter — same bit-width,
different exponent range). Activation memory should also be equal. The planner's weight memory
formula reads dtype from safetensors metadata; verify it correctly handles a model loaded in a
dtype different from the one it was saved in. KV cache block count must be identical across
dtype values because
--kv-cache-dtype float16is pinned.--quantizationSpecifies the compression scheme applied to model weights at load time. Quantized weights are
stored in low precision (INT4/INT8/FP8) but de-quantized back to the
dtypeprecision duringactual math operations — so activation memory depends on
dtype, not on quantization format.Hold fixed:
--dtype float16,--kv-cache-dtype float16.meta-llama/Llama-3.1-8B-Instruct(TP=1)fp8redhatai/Llama-3.1-8B-Instruct-quantized.w8a8(TP=1)awqredhatai/Llama-3.1-8B-Instruct-quantized.w4a16(TP=1)Weight memory scales with quantization bit-width — AWQ INT4 ≈ 25% of FP16,
FP8 ≈ 50%. Activation memory should be equal across all three because de-quantization
restores the
dtype=float16precision for compute; any divergence in activation memory is afinding to investigate. KV cache block count should be identical (kv_cache_dtype pinned).
The planner reads raw safetensors tensor dtypes; verify it sums quantized tensor sizes correctly
rather than treating all parameters as the model's nominal dtype.
Note: the primary matrix already includes
redhatai/Llama-3.3-70B-Instruct-quantized.w8a8for large-scale quantization coverage. This sweep isolates quantization format effects on a
single controlled model.
--kv-cache-dtypeControls the precision used for the KV cache independently of weight and activation dtype.
autoinherits from the model'sdtype— so with the defaults, KV cache is FP16 for FP16models and BF16 for BF16 models.
Hold fixed:
--dtype float16(ensuresautobaseline resolves to FP16, making thecomparison clean:
auto=FP16 vs explicitfp8).auto(=FP16 with dtype=float16),fp8Switching from FP16 to FP8 halves the per-token KV cache byte cost, so GPU KV
block count should approximately double at the same memory budget:
Weight memory and activation memory must be identical across both runs (dtype is pinned).
The planner's
KVCacheDetailformula currently derivesprecision_bytesfrom the model'sweight dtype — verify it accepts
kv_cache_dtypeas an explicit override. This is the mostoperationally important test in this section: FP8 KV cache is a common production optimization
and a planner that ignores
kv_cache_dtypewill systematically underestimate available KVmemory for any deployment using it.
Hardware
gpu_memory_utilization = 0.95(vLLM default) for all runsvLLM Configuration
Use a consistent, minimal vLLM configuration to isolate baseline memory behavior:
Record the exact vLLM version tag and commit SHA in all output files. Re-run if a new minor
version is released before the benchmarking campaign completes.
Measurement Procedure
For each
(model, GPU, vllm_args)run:offloading,cudaMalloc failed,insufficient memory,retrying allocationtorch allocatedreported after weights loadactivation_memory_used_gbfrom the memory profiling summary(logged after the dummy warmup forward pass)
non_torch_memory_used_gbfrom the same summary line# GPU blocks: Nactual_kv_gib = N × 16 × per_token_bytes_from_model_config / 1024³allocatable_kv_cache_memory()with matching parameterslog_pathandtimestampin a structured JSON output file per runPP runs: vLLM assigns contiguous layer ranges to each pipeline stage; each stage emits its own
memory profiling summary. Collect per-stage logs and record weight/activation/non-torch for each
stage separately before summing to cross-check against the single-node total.
DP runs: each DP replica is an independent process — collect one log per replica. All replicas
should produce identical measurements; flag any divergence.
Reporting
Per-run output (JSON)
{ "model": "Qwen/Qwen2.5-7B-Instruct", "gpu": "H100-80GB", "vllm_args": { "tensor_parallel_size": 1, "pipeline_parallel_size": 1, "data_parallel_size": 1, "max_model_len": 8192, "gpu_memory_utilization": 0.95, "dtype": "auto", "quantization": null, "kv_cache_dtype": "auto" }, "vllm_version": "0.x.y", "vllm_commit": "abc123", "timestamp": "2025-04-17T10:23:00Z", "duration_seconds": 142, "log_path": "logs/qwen2.5-7b-h100-tp1.log", "measured": { "weight_memory_gib": 14.2, "activation_memory_gib": 5.3, "non_torch_memory_gib": 0.14, "kv_cache_blocks": 12480, "kv_cache_gib": 58.1 }, "planner_predicted": { "weight_memory_gib": 14.2, "activation_memory_gib": 5.6, "non_torch_memory_gib": 0.15, "kv_cache_gib": 57.6 }, "error_pct": { "weight_memory": 0.0, "activation_memory": 5.7, "non_torch_memory": 7.1, "kv_cache": -0.9 } }Summary report (Markdown)
Publish as
docs/benchmarks/memory-validation-report.mdcontaining:error%max_model_len: KV blocks vs. context length — confirm linearity or document deviationdtype: weight and activation memory vs. precision; verify planner dtype pathquantization: weight memory compression ratio vs. format (AWQ, FP8); verify safetensors metadata pathkv_cache_dtype: KV block count vs. cache precision (FP16 vs FP8); confirm ~2× block count for FP8Deliverables
scripts/validate_memory/(model, GPU, TP)runs and argument sensitivity sweeps indata/benchmarks/memory/docs/benchmarks/memory-validation-report.mdcapacity_planner.pyandVALIDATED_ACTIVATION_PROFILESwhere any component exceeds ±10% error
@pytest.mark.unittests asserting KV cache block count formulafor at least one dense, one MoE, and one MLA model config (using mocked HF config, no GPU required)
Out of Scope
Related
src/planner/capacity_planner.py— all memory and KV cache formulassrc/planner/recommendation/estimator.py:209— usescheck_model_fits_gpufor memory feasibilitysrc/planner/recommendation/config_finder.py— calls estimator for configs without benchmarksReferences
KV cache estimation tool for vLLM deployments; useful comparison point for formula validation
LLM inference; reference for cross-checking capacity planning outputs