Skip to content

Extensive validation of capacity planner memory and KV cache calculations against real vLLM #194

@jgchn

Description

@jgchn

Extensive validation of capacity planner memory and KV cache calculations against real vLLM

Background

The capacity planner computes GPU memory requirements and KV cache allocations using formulas and
constants in src/planner/capacity_planner.py. The existing constants were derived from a limited
set of models and hardware configurations. This issue tracks an extensive validation campaign across
a wider variety of model architectures using the latest vLLM release, and recalibrates any constants that have drifted.

The results should be published as a detailed technical report consumable by the open-source
community, documenting methodology, raw measurements, and any constant changes made.

What to Validate

For each (model, GPU, TP) combination, compare planner predictions against vLLM actuals for the
following five quantities:

1. Model weight memory (model_memory_req, line 553)

Planner reads weight memory from safetensors tensor metadata and sums num_params × dtype_bytes
per dtype. Compare against vLLM's reported torch allocated at startup, before KV cache
allocation begins (captured from the vLLM startup log line "weights loaded").

2. Activation memory (estimate_vllm_activation_memory, line 380)

Planner uses hardcoded constants per architecture class:

Architecture class Constant (GiB) Status
LlamaForCausalLM 4.8 Validated profile
Qwen2ForCausalLM 5.6 Validated profile
Qwen3ForCausalLM 5.6 Validated profile (assumed same as Qwen2)
GemmaForCausalLM 5.5 Dense fallback
PhiForCausalLM 5.5 Dense fallback
GraniteForCausalLM 5.5 Dense fallback
QWenLMHeadModel 5.5 Dense fallback
Qwen3MoeForCausalLM 8.0 MoE fallback
MistralForCausalLM / MixtralForCausalLM 8.0 MoE fallback
PixtralForConditionalGeneration 2.5 Validated multimodal profile

vLLM performs a dummy forward pass during engine warmup and logs the peak activation memory
directly. Read activation_memory_used_gb from the vLLM startup log (look for the memory
profiling summary line emitted after the dummy run). No inference call or derived formula needed.

3. KV cache per-token memory (KVCacheDetail, line 173)

Standard attention (MHA/GQA/MQA):

per_token_bytes = num_hidden_layers × 2 × num_key_value_heads × head_dim × precision_bytes

Multi-head Latent Attention (MLA):

per_token_bytes = num_hidden_layers × (kv_lora_rank + qk_rope_head_dim) × precision_bytes

Validate purely from startup data — no inference call required. vLLM logs the allocated block
count at startup:

# GPU blocks: <N>

The actual KV pool size is therefore:

actual_kv_gib = N × block_size × per_token_bytes / 1024³    (block_size = 16 tokens by default)

Compare against the planner's allocatable_kv_cache_memory() output directly.

4. Non-torch memory overhead (estimate_vllm_non_torch_memory, line 29)

Constants: 0.15 GiB for TP=1, 0.6 GiB for TP≥2. vLLM's startup memory profiling summary also
logs non_torch_memory_used_gb — read it directly from the same log line as activation memory.

5. Total allocatable KV cache (allocatable_kv_cache_memory, line 855)

The master formula the planner uses:

allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
               - (model_weights_per_gpu × DP)
               - (activation_memory × DP)
               - non_torch_overhead

DP is data parallelism. All runs in this campaign use standard single-process vLLM with TP-only
parallelism, so DP=1 throughout and the formula reduces to:

allocatable_kv = (gpu_memory × gpu_util × num_GPUs) - model_weights_per_gpu - activation_memory - non_torch_overhead

Compute end-to-end error against actual KV memory derived from block count:

actual_kv_gib = (vllm_kv_blocks × 16 × per_token_bytes_from_model_config) / 1024³
error% = (planner_allocatable - actual_kv_gib) / actual_kv_gib × 100

Model Matrix

Models are grouped by family. Each family section notes whether it has an existing validated
profile or falls through to a generic constant, and what architectural question the family answers.

Meta / Llama family

Dense and MoE variants from the same base architecture, spanning small to large and covering the
validated LlamaForCausalLM profile plus the Llama-based MoE class.

Model Architecture class Type Notes
meta-llama/Llama-3.1-8B-Instruct LlamaForCausalLM Dense Has validated profile (4.8 GiB) — primary baseline
meta-llama/Llama-3.3-70B-Instruct LlamaForCausalLM Dense Tests validated constant at 70B scale
meta-llama/Llama-4-Scout-17B-16E-Instruct LlamaForCausalLM (MoE variant) MoE Llama-based MoE — architecturally distinct from Mixtral; uses MoE fallback
redhatai/Llama-3.3-70B-Instruct-quantized.w8a8 LlamaForCausalLM Dense quantized Tests quantized weight memory path

Qwen family (full generational sweep)

Each Qwen generation introduces a new architecture class. Testing all generations reveals whether
constants need per-generation tuning or whether a shared constant suffices.

Model Architecture class Type Notes
Qwen/Qwen-7B-Chat QWenLMHeadModel Dense Qwen1; no validated profile, uses dense fallback
Qwen/Qwen2.5-7B-Instruct Qwen2ForCausalLM Dense Has validated profile (5.6 GiB)
Qwen/Qwen2.5-72B-Instruct Qwen2ForCausalLM Dense Tests validated constant at 72B scale
Qwen/Qwen3-8B Qwen3ForCausalLM Dense Validated profile assumed equal to Qwen2 — confirm
Qwen/Qwen3-30B-A3B Qwen3MoeForCausalLM MoE 128 experts, 8 active; uses MoE fallback constant
Qwen-next (latest instruct at time of benchmarking) TBD TBD Add validated profile if a new architecture class is introduced

Mistral family

Sparse MoE with a different expert structure and routing than Llama MoE or Qwen3 MoE; the
canonical reference model for the MoE fallback constant.

Model Architecture class Type Notes
mistralai/Mixtral-8x7B-Instruct-v0.1 MixtralForCausalLM MoE 8 experts, 2 active; primary MoE fallback baseline
mistralai/Mistral-Small-3.1-24B-Instruct-2503 MistralForCausalLM Dense Tests dense fallback for Mistral architecture at 24B

Microsoft / Phi family

Small dense models; uses dense fallback constant. Worth validating because Phi models are
memory-efficient and the activation overhead ratio relative to model size is higher than larger models.

Model Architecture class Type Notes
microsoft/phi-4 Phi3ForCausalLM Dense Uses dense fallback (5.5 GiB)

Google / Gemma family

Dense GQA models; uses dense fallback constant. Testing small and large variants confirms whether
the constant holds across a 7× parameter range within the same family.

Model Architecture class Type Notes
google/gemma-3-4b-it Gemma3ForCausalLM Dense Dense fallback, small
google/gemma-3-27b-it Gemma3ForCausalLM Dense Dense fallback, large — tests constant under memory pressure

IBM / Granite family

Granite spans multiple generations (3.1, 3.3, 4.x), a small/large size axis, and a multimodal
variant — all under GraniteForCausalLM for text models and LlavaNextForConditionalGeneration
for vision. This makes it well-suited for checking whether the dense fallback constant holds
across generations and scales, and whether the multimodal constant (currently only validated on
Pixtral) generalizes to a second architecture.

Model Architecture class Type Notes
ibm-granite/granite-3.1-2b-instruct GraniteForCausalLM Dense Smallest variant; activation overhead ratio is highest relative to model size here
ibm-granite/granite-3.1-8b-instruct GraniteForCausalLM Dense Dense fallback baseline
ibm-granite/granite-3.3-8b-instruct GraniteForCausalLM Dense Newer generation — confirms constant did not change across 3.1 → 3.3
ibm-granite/granite-4.0-8b-instruct (or latest 4.x at benchmarking time) TBD Dense If 4.x introduces a new architecture class, add a validated profile; otherwise confirms 3.x constant carries forward
ibm-granite/granite-vision-3.3-2b LlavaNextForConditionalGeneration Multimodal Multimodal fallback (2.5 GiB); second architecture in this class after Pixtral — tests generalizability of the constant

MLA coverage (DeepSeek)

The planner has a dedicated KV cache formula for Multi-head Latent Attention (MLA), but no MLA
model currently appears in the matrix — meaning the formula has never been validated end-to-end.
DeepSeek-V2-Lite (16B total, 2.4B active) is the only practical single-node MLA option; larger
DeepSeek models require multi-node and are out of scope.

Model Architecture class Type Notes
deepseek-ai/DeepSeek-V2-Lite-Chat DeepseekV2ForCausalLM MoE + MLA Only single-node MLA model available; validates the MLA KV formula; uses MoE fallback for activation

OpenAI OSS family

Novel MoE architecture (gpt_oss) not yet in VALIDATED_ACTIVATION_PROFILES; falls through to
the MoE fallback. Both models are already in the benchmark database. Testing at two scales
(20B and 120B total params) tests whether the fallback holds across sizes within this family and
may motivate a dedicated validated profile.

Model Architecture class Type Notes
openai/gpt-oss-20b GptOssForCausalLM MoE 32 experts, 4 active; 21B total / 3.6B active
openai/gpt-oss-120b GptOssForCausalLM MoE 117B total / 5.1B active

Tensor Parallelism Matrix

For every model, iterate over all TP values where the model fits in GPU memory. The non-torch
overhead constant changes at TP≥2 — this transition point is explicitly worth verifying.

Model size TP values to test
≤8B dense TP=1, TP=2, TP=4
14B–32B dense TP=1, TP=2, TP=4
70B–72B dense TP=2, TP=4, TP=8
MoE (30B–70B active) TP=1, TP=2, TP=4
Quantized 70B TP=1, TP=2, TP=4

For each TP step, record the non-torch overhead separately to confirm whether the TP=1 / TP≥2
binary constant is sufficient or whether a per-TP-value table is more accurate.


vLLM Argument Sensitivity

This section isolates the effect of individual vLLM launch arguments on each memory component.
All runs use meta-llama/Llama-3.1-8B-Instruct (primary) and Qwen/Qwen2.5-7B-Instruct
(secondary) on H100 80GB, holding all other arguments at the standard values from the
vLLM Configuration section and varying exactly one argument at a time.

--max-model-len

Sweep values Models
2048, 4096, 8192, 16384, 32768 Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1)

KV block count scales linearly with max_model_len; weight memory and activation
memory are unaffected. Record any deviation — non-linear KV scaling or changes to activation memory
would indicate vLLM internal behavior (e.g., CUDA graph capture bucketing at different context
sizes) that the planner must account for.

--pipeline-parallel-size (PP)

Sweep values Models Hardware requirement
PP=1, 2, 4 Llama-3.1-8B-Instruct (TP=1) 1, 2, 4× H100 80GB

With PP=N, each GPU hosts approximately 1/N of the model layers, so weight memory
per GPU ≈ total / N and activation memory per GPU ≈ baseline / N. Non-torch overhead is expected
to increase per GPU due to inter-stage communication buffers. The current planner formula does not
model PP — quantify the per-GPU error across PP values and determine whether a PP correction factor
is needed.

Measure all five quantities (weight, activation, non-torch, KV blocks, total allocatable KV) per
GPU at each PP value and record the per-stage breakdown where the vLLM log exposes it.

--data-parallel-size (DP)

Sweep values Models Hardware requirement
DP=1, 2 Llama-3.1-8B-Instruct (TP=1) 1, 2× H100 80GB

Each DP replica is an independent process owning a full model copy, so per-GPU
memory usage for a DP=2 instance should be identical to a DP=1 instance. The planner formula
multiplies weight and activation memory by DP:

allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
               - (model_weights_per_gpu × DP)
               - (activation_memory × DP)
               - non_torch_overhead

If per-GPU measurements are unchanged across DP values, the DP multiplier in the formula models
total cluster memory consumed (across all replicas), not per-GPU allocation — document which
interpretation is correct and whether the formula needs clarification or correction.

Interaction summary

These three arguments control distinct memory regions but are coupled in two ways that matter
for experimental design:

  1. kv_cache_dtype=auto inherits from dtype: when --kv-cache-dtype auto (the default),
    vLLM sets the KV cache precision equal to the model's activation dtype. Sweeping --dtype
    without pinning --kv-cache-dtype therefore changes both weight/activation memory and KV
    cache precision simultaneously — confounding the measurement. To isolate each effect, pin the
    other argument explicitly.

  2. Quantized weights de-quantize to dtype for compute: weights stored in INT4 (AWQ) or FP8
    are expanded back to the dtype precision (e.g., BF16) during the forward pass. Activation
    memory therefore depends on dtype, not on quantization format. A change in activation
    memory observed between quantized and unquantized runs indicates a problem with the de-quant
    path, not an expected quantization effect.

Sweep design consequence: all three sweeps below pin the arguments they are not varying to
float16 to avoid cross-contamination.

--dtype

Controls the precision used for model weights and activations during computation. auto matches
the dtype the model was originally saved in (typically BF16 for modern models).

Hold fixed: --kv-cache-dtype float16 (pin explicitly to prevent KV cache precision from
also changing when dtype changes — see interaction note above).

Sweep values Models
float16 (baseline), bfloat16 Llama-3.1-8B-Instruct (TP=1)

BF16 ≈ FP16 in weight memory (both are 2 bytes/parameter — same bit-width,
different exponent range). Activation memory should also be equal. The planner's weight memory
formula reads dtype from safetensors metadata; verify it correctly handles a model loaded in a
dtype different from the one it was saved in. KV cache block count must be identical across
dtype values because --kv-cache-dtype float16 is pinned.

--quantization

Specifies the compression scheme applied to model weights at load time. Quantized weights are
stored in low precision (INT4/INT8/FP8) but de-quantized back to the dtype precision during
actual math operations — so activation memory depends on dtype, not on quantization format.

Hold fixed: --dtype float16, --kv-cache-dtype float16.

Sweep values Models Notes
none (baseline) meta-llama/Llama-3.1-8B-Instruct (TP=1) FP16 weights
fp8 redhatai/Llama-3.1-8B-Instruct-quantized.w8a8 (TP=1) FP8 weights; ~2× compression expected
awq redhatai/Llama-3.1-8B-Instruct-quantized.w4a16 (TP=1) INT4 weights; ~4× compression expected

Weight memory scales with quantization bit-width — AWQ INT4 ≈ 25% of FP16,
FP8 ≈ 50%. Activation memory should be equal across all three because de-quantization
restores the dtype=float16 precision for compute; any divergence in activation memory is a
finding to investigate. KV cache block count should be identical (kv_cache_dtype pinned).
The planner reads raw safetensors tensor dtypes; verify it sums quantized tensor sizes correctly
rather than treating all parameters as the model's nominal dtype.

Note: the primary matrix already includes redhatai/Llama-3.3-70B-Instruct-quantized.w8a8
for large-scale quantization coverage. This sweep isolates quantization format effects on a
single controlled model.

--kv-cache-dtype

Controls the precision used for the KV cache independently of weight and activation dtype.
auto inherits from the model's dtype — so with the defaults, KV cache is FP16 for FP16
models and BF16 for BF16 models.

Hold fixed: --dtype float16 (ensures auto baseline resolves to FP16, making the
comparison clean: auto=FP16 vs explicit fp8).

Sweep values Models
auto (=FP16 with dtype=float16), fp8 Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1)

Switching from FP16 to FP8 halves the per-token KV cache byte cost, so GPU KV
block count should approximately double at the same memory budget:

per_token_bytes (FP16) = num_hidden_layers × 2 × num_kv_heads × head_dim × 2
per_token_bytes (FP8)  = num_hidden_layers × 2 × num_kv_heads × head_dim × 1

Weight memory and activation memory must be identical across both runs (dtype is pinned).
The planner's KVCacheDetail formula currently derives precision_bytes from the model's
weight dtype — verify it accepts kv_cache_dtype as an explicit override. This is the most
operationally important test in this section: FP8 KV cache is a common production optimization
and a planner that ignores kv_cache_dtype will systematically underestimate available KV
memory for any deployment using it.


Hardware

  • Primary: H100 80GB SXM (or PCIe)
  • Secondary: A100 80GB, if available
  • gpu_memory_utilization = 0.95 (vLLM default) for all runs

vLLM Configuration

Use a consistent, minimal vLLM configuration to isolate baseline memory behavior:

vllm serve <model> \
  --tensor-parallel-size <TP> \
  --gpu-memory-utilization 0.95 \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --max-model-len 8192   # fixed context window to control KV pool size

Record the exact vLLM version tag and commit SHA in all output files. Re-run if a new minor
version is released before the benchmarking campaign completes.


Measurement Procedure

For each (model, GPU, vllm_args) run:

  1. Launch vLLM and capture the full startup log to a file
  2. Confirm clean startup — abort and flag the run if the log contains any of:
    offloading, cudaMalloc failed, insufficient memory, retrying allocation
  3. Extract from the startup log:
    • Weight memory: torch allocated reported after weights load
    • Activation memory: activation_memory_used_gb from the memory profiling summary
      (logged after the dummy warmup forward pass)
    • Non-torch overhead: non_torch_memory_used_gb from the same summary line
    • KV block count: # GPU blocks: N
  4. Compute actual KV pool size:
    actual_kv_gib = N × 16 × per_token_bytes_from_model_config / 1024³
  5. Run planner's allocatable_kv_cache_memory() with matching parameters
  6. Record all values plus log_path and timestamp in a structured JSON output file per run

PP runs: vLLM assigns contiguous layer ranges to each pipeline stage; each stage emits its own
memory profiling summary. Collect per-stage logs and record weight/activation/non-torch for each
stage separately before summing to cross-check against the single-node total.

DP runs: each DP replica is an independent process — collect one log per replica. All replicas
should produce identical measurements; flag any divergence.


Reporting

Per-run output (JSON)

{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "gpu": "H100-80GB",
  "vllm_args": {
    "tensor_parallel_size": 1,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 1,
    "max_model_len": 8192,
    "gpu_memory_utilization": 0.95,
    "dtype": "auto",
    "quantization": null,
    "kv_cache_dtype": "auto"
  },
  "vllm_version": "0.x.y",
  "vllm_commit": "abc123",
  "timestamp": "2025-04-17T10:23:00Z",
  "duration_seconds": 142,
  "log_path": "logs/qwen2.5-7b-h100-tp1.log",
  "measured": {
    "weight_memory_gib": 14.2,
    "activation_memory_gib": 5.3,
    "non_torch_memory_gib": 0.14,
    "kv_cache_blocks": 12480,
    "kv_cache_gib": 58.1
  },
  "planner_predicted": {
    "weight_memory_gib": 14.2,
    "activation_memory_gib": 5.6,
    "non_torch_memory_gib": 0.15,
    "kv_cache_gib": 57.6
  },
  "error_pct": {
    "weight_memory": 0.0,
    "activation_memory": 5.7,
    "non_torch_memory": 7.1,
    "kv_cache": -0.9
  }
}

Summary report (Markdown)

Publish as docs/benchmarks/memory-validation-report.md containing:

  • Per-component error table: rows = models, columns = weight / activation / non-torch / KV cache allocatable, cells = error%
  • Per-architecture error table: grouped by architecture class, showing mean and max absolute error per component
  • TP sensitivity table: for each model, show how activation and non-torch overhead vary with TP
  • Argument sensitivity tables (one per argument from the vLLM Argument Sensitivity section):
    • max_model_len: KV blocks vs. context length — confirm linearity or document deviation
    • PP: per-GPU weight / activation / non-torch / KV at each PP value; note planner formula gap
    • DP: per-replica measurements across DP values; document whether DP multiplier is per-GPU or cluster-total
    • dtype: weight and activation memory vs. precision; verify planner dtype path
    • quantization: weight memory compression ratio vs. format (AWQ, FP8); verify safetensors metadata path
    • kv_cache_dtype: KV block count vs. cache precision (FP16 vs FP8); confirm ~2× block count for FP8
  • Outliers: any model where any component exceeds ±10% error — root cause analysis required
  • Calibration decisions: document each constant changed, old value → new value, evidence

Deliverables

  • Measurement scripts added to scripts/validate_memory/
  • Raw JSON results for all primary (model, GPU, TP) runs and argument sensitivity sweeps in data/benchmarks/memory/
  • Published summary report at docs/benchmarks/memory-validation-report.md
  • Calibration PR: update constants in capacity_planner.py and VALIDATED_ACTIVATION_PROFILES
    where any component exceeds ±10% error
  • Unit tests: parameterized @pytest.mark.unit tests asserting KV cache block count formula
    for at least one dense, one MoE, and one MLA model config (using mocked HF config, no GPU required)
  • Publish an accuracy blog article for the community

Out of Scope

  • Latency or throughput accuracy
  • Multi-node / multi-host tensor parallelism
  • Prefix caching effects
  • Speculative decoding
  • Roofline estimation accuracy

Related

  • src/planner/capacity_planner.py — all memory and KV cache formulas
  • src/planner/recommendation/estimator.py:209 — uses check_model_fits_gpu for memory feasibility
  • src/planner/recommendation/config_finder.py — calls estimator for configs without benchmarks

References

  • vllm_config_estimator — GPU memory and
    KV cache estimation tool for vLLM deployments; useful comparison point for formula validation
  • GPU Calculator — interactive GPU memory calculator for
    LLM inference; reference for cross-checking capacity planning outputs

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions