Extensive validation of capacity planner memory and KV cache calculations against real vLLM

# Extensive validation of capacity planner memory and KV cache calculations against real vLLM

## Background

The capacity planner computes GPU memory requirements and KV cache allocations using formulas and
constants in `src/planner/capacity_planner.py`. The existing constants were derived from a limited
set of models and hardware configurations. This issue tracks an extensive validation campaign across
a wider variety of model architectures using the latest vLLM release, and recalibrates any constants that have drifted.

The results should be published as a detailed technical report consumable by the open-source
community, documenting methodology, raw measurements, and any constant changes made.

## What to Validate

For each `(model, GPU, TP)` combination, compare planner predictions against vLLM actuals for the
following five quantities:

### 1. Model weight memory (`model_memory_req`, line 553)

Planner reads weight memory from safetensors tensor metadata and sums `num_params × dtype_bytes`
per dtype. Compare against vLLM's reported `torch allocated` at startup, **before** KV cache
allocation begins (captured from the vLLM startup log line `"weights loaded"`).

### 2. Activation memory (`estimate_vllm_activation_memory`, line 380)

Planner uses hardcoded constants per architecture class:

| Architecture class | Constant (GiB) | Status |
|---|---|---|
| `LlamaForCausalLM` | 4.8 | Validated profile |
| `Qwen2ForCausalLM` | 5.6 | Validated profile |
| `Qwen3ForCausalLM` | 5.6 | Validated profile (assumed same as Qwen2) |
| `GemmaForCausalLM` | 5.5 | Dense fallback |
| `PhiForCausalLM` | 5.5 | Dense fallback |
| `GraniteForCausalLM` | 5.5 | Dense fallback |
| `QWenLMHeadModel` | 5.5 | Dense fallback |
| `Qwen3MoeForCausalLM` | 8.0 | MoE fallback |
| `MistralForCausalLM` / `MixtralForCausalLM` | 8.0 | MoE fallback |
| `PixtralForConditionalGeneration` | 2.5 | Validated multimodal profile |

vLLM performs a dummy forward pass during engine warmup and logs the peak activation memory
directly. Read `activation_memory_used_gb` from the vLLM startup log (look for the memory
profiling summary line emitted after the dummy run). No inference call or derived formula needed.

### 3. KV cache per-token memory (`KVCacheDetail`, line 173)

**Standard attention (MHA/GQA/MQA):**
```
per_token_bytes = num_hidden_layers × 2 × num_key_value_heads × head_dim × precision_bytes
```
**Multi-head Latent Attention (MLA):**
```
per_token_bytes = num_hidden_layers × (kv_lora_rank + qk_rope_head_dim) × precision_bytes
```

Validate purely from startup data — no inference call required. vLLM logs the allocated block
count at startup:
```
# GPU blocks: <N>
```
The actual KV pool size is therefore:
```
actual_kv_gib = N × block_size × per_token_bytes / 1024³    (block_size = 16 tokens by default)
```
Compare against the planner's `allocatable_kv_cache_memory()` output directly.

### 4. Non-torch memory overhead (`estimate_vllm_non_torch_memory`, line 29)

Constants: 0.15 GiB for TP=1, 0.6 GiB for TP≥2. vLLM's startup memory profiling summary also
logs `non_torch_memory_used_gb` — read it directly from the same log line as activation memory.

### 5. Total allocatable KV cache (`allocatable_kv_cache_memory`, line 855)

The master formula the planner uses:
```
allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
               - (model_weights_per_gpu × DP)
               - (activation_memory × DP)
               - non_torch_overhead
```
`DP` is data parallelism. All runs in this campaign use standard single-process vLLM with TP-only
parallelism, so `DP=1` throughout and the formula reduces to:
```
allocatable_kv = (gpu_memory × gpu_util × num_GPUs) - model_weights_per_gpu - activation_memory - non_torch_overhead
```
Compute end-to-end error against actual KV memory derived from block count:
```
actual_kv_gib = (vllm_kv_blocks × 16 × per_token_bytes_from_model_config) / 1024³
error% = (planner_allocatable - actual_kv_gib) / actual_kv_gib × 100
```

---

## Model Matrix

Models are grouped by family. Each family section notes whether it has an existing validated
profile or falls through to a generic constant, and what architectural question the family answers.

### Meta / Llama family

Dense and MoE variants from the same base architecture, spanning small to large and covering the
validated `LlamaForCausalLM` profile plus the Llama-based MoE class.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `meta-llama/Llama-3.1-8B-Instruct` | `LlamaForCausalLM` | Dense | Has validated profile (4.8 GiB) — primary baseline |
| `meta-llama/Llama-3.3-70B-Instruct` | `LlamaForCausalLM` | Dense | Tests validated constant at 70B scale |
| `meta-llama/Llama-4-Scout-17B-16E-Instruct` | `LlamaForCausalLM` (MoE variant) | MoE | Llama-based MoE — architecturally distinct from Mixtral; uses MoE fallback |
| `redhatai/Llama-3.3-70B-Instruct-quantized.w8a8` | `LlamaForCausalLM` | Dense quantized | Tests quantized weight memory path |

### Qwen family (full generational sweep)

Each Qwen generation introduces a new architecture class. Testing all generations reveals whether
constants need per-generation tuning or whether a shared constant suffices.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `Qwen/Qwen-7B-Chat` | `QWenLMHeadModel` | Dense | Qwen1; no validated profile, uses dense fallback |
| `Qwen/Qwen2.5-7B-Instruct` | `Qwen2ForCausalLM` | Dense | Has validated profile (5.6 GiB) |
| `Qwen/Qwen2.5-72B-Instruct` | `Qwen2ForCausalLM` | Dense | Tests validated constant at 72B scale |
| `Qwen/Qwen3-8B` | `Qwen3ForCausalLM` | Dense | Validated profile assumed equal to Qwen2 — confirm |
| `Qwen/Qwen3-30B-A3B` | `Qwen3MoeForCausalLM` | MoE | 128 experts, 8 active; uses MoE fallback constant |
| Qwen-next (latest instruct at time of benchmarking) | TBD | TBD | Add validated profile if a new architecture class is introduced |

### Mistral family

Sparse MoE with a different expert structure and routing than Llama MoE or Qwen3 MoE; the
canonical reference model for the MoE fallback constant.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `mistralai/Mixtral-8x7B-Instruct-v0.1` | `MixtralForCausalLM` | MoE | 8 experts, 2 active; primary MoE fallback baseline |
| `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | `MistralForCausalLM` | Dense | Tests dense fallback for Mistral architecture at 24B |

### Microsoft / Phi family

Small dense models; uses dense fallback constant. Worth validating because Phi models are
memory-efficient and the activation overhead ratio relative to model size is higher than larger models.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `microsoft/phi-4` | `Phi3ForCausalLM` | Dense | Uses dense fallback (5.5 GiB) |

### Google / Gemma family

Dense GQA models; uses dense fallback constant. Testing small and large variants confirms whether
the constant holds across a 7× parameter range within the same family.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `google/gemma-3-4b-it` | `Gemma3ForCausalLM` | Dense | Dense fallback, small |
| `google/gemma-3-27b-it` | `Gemma3ForCausalLM` | Dense | Dense fallback, large — tests constant under memory pressure |

### IBM / Granite family

Granite spans multiple generations (3.1, 3.3, 4.x), a small/large size axis, and a multimodal
variant — all under `GraniteForCausalLM` for text models and `LlavaNextForConditionalGeneration`
for vision. This makes it well-suited for checking whether the dense fallback constant holds
across generations and scales, and whether the multimodal constant (currently only validated on
Pixtral) generalizes to a second architecture.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `ibm-granite/granite-3.1-2b-instruct` | `GraniteForCausalLM` | Dense | Smallest variant; activation overhead ratio is highest relative to model size here |
| `ibm-granite/granite-3.1-8b-instruct` | `GraniteForCausalLM` | Dense | Dense fallback baseline |
| `ibm-granite/granite-3.3-8b-instruct` | `GraniteForCausalLM` | Dense | Newer generation — confirms constant did not change across 3.1 → 3.3 |
| `ibm-granite/granite-4.0-8b-instruct` (or latest 4.x at benchmarking time) | TBD | Dense | If 4.x introduces a new architecture class, add a validated profile; otherwise confirms 3.x constant carries forward |
| `ibm-granite/granite-vision-3.3-2b` | `LlavaNextForConditionalGeneration` | Multimodal | Multimodal fallback (2.5 GiB); second architecture in this class after Pixtral — tests generalizability of the constant |

### MLA coverage (DeepSeek)

The planner has a dedicated KV cache formula for Multi-head Latent Attention (MLA), but no MLA
model currently appears in the matrix — meaning the formula has never been validated end-to-end.
DeepSeek-V2-Lite (16B total, 2.4B active) is the only practical single-node MLA option; larger
DeepSeek models require multi-node and are out of scope.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `deepseek-ai/DeepSeek-V2-Lite-Chat` | `DeepseekV2ForCausalLM` | MoE + MLA | Only single-node MLA model available; validates the MLA KV formula; uses MoE fallback for activation |

### OpenAI OSS family

Novel MoE architecture (`gpt_oss`) not yet in `VALIDATED_ACTIVATION_PROFILES`; falls through to
the MoE fallback. Both models are already in the benchmark database. Testing at two scales
(20B and 120B total params) tests whether the fallback holds across sizes within this family and
may motivate a dedicated validated profile.

| Model | Architecture class | Type | Notes |
|---|---|---|---|
| `openai/gpt-oss-20b` | `GptOssForCausalLM` | MoE | 32 experts, 4 active; 21B total / 3.6B active |
| `openai/gpt-oss-120b` | `GptOssForCausalLM` | MoE | 117B total / 5.1B active |

---

## Tensor Parallelism Matrix

For every model, iterate over all TP values where the model fits in GPU memory. The non-torch
overhead constant changes at TP≥2 — this transition point is explicitly worth verifying.

| Model size | TP values to test |
|---|---|
| ≤8B dense | TP=1, TP=2, TP=4 |
| 14B–32B dense | TP=1, TP=2, TP=4 |
| 70B–72B dense | TP=2, TP=4, TP=8 |
| MoE (30B–70B active) | TP=1, TP=2, TP=4 |
| Quantized 70B | TP=1, TP=2, TP=4 | TP=1 validates that quantization reduces weight memory enough to fit; TP=2/4 tests whether non-torch overhead scaling matches the unquantized model at the same TP values |

For each TP step, record the non-torch overhead separately to confirm whether the TP=1 / TP≥2
binary constant is sufficient or whether a per-TP-value table is more accurate.

---

## vLLM Argument Sensitivity

This section isolates the effect of individual vLLM launch arguments on each memory component.
All runs use `meta-llama/Llama-3.1-8B-Instruct` (primary) and `Qwen/Qwen2.5-7B-Instruct`
(secondary) on H100 80GB, holding all other arguments at the standard values from the
[vLLM Configuration](#vllm-configuration) section and varying exactly one argument at a time.

### `--max-model-len`

| Sweep values | Models |
|---|---|
| 2048, 4096, 8192, 16384, 32768 | Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1) |

KV block count scales linearly with `max_model_len`; weight memory and activation
memory are unaffected. Record any deviation — non-linear KV scaling or changes to activation memory
would indicate vLLM internal behavior (e.g., CUDA graph capture bucketing at different context
sizes) that the planner must account for.

### `--pipeline-parallel-size` (PP)

| Sweep values | Models | Hardware requirement |
|---|---|---|
| PP=1, 2, 4 | Llama-3.1-8B-Instruct (TP=1) | 1, 2, 4× H100 80GB |

With PP=N, each GPU hosts approximately 1/N of the model layers, so weight memory
per GPU ≈ total / N and activation memory per GPU ≈ baseline / N. Non-torch overhead is expected
to increase per GPU due to inter-stage communication buffers. The current planner formula does not
model PP — quantify the per-GPU error across PP values and determine whether a PP correction factor
is needed.

Measure all five quantities (weight, activation, non-torch, KV blocks, total allocatable KV) per
GPU at each PP value and record the per-stage breakdown where the vLLM log exposes it.

### `--data-parallel-size` (DP)

| Sweep values | Models | Hardware requirement |
|---|---|---|
| DP=1, 2 | Llama-3.1-8B-Instruct (TP=1) | 1, 2× H100 80GB |

Each DP replica is an independent process owning a full model copy, so per-GPU
memory usage for a DP=2 instance should be identical to a DP=1 instance. The planner formula
multiplies weight and activation memory by DP:

```
allocatable_kv = (gpu_memory × gpu_util × num_GPUs)
               - (model_weights_per_gpu × DP)
               - (activation_memory × DP)
               - non_torch_overhead
```

If per-GPU measurements are unchanged across DP values, the DP multiplier in the formula models
total cluster memory consumed (across all replicas), not per-GPU allocation — document which
interpretation is correct and whether the formula needs clarification or correction.

### Interaction summary

These three arguments control distinct memory regions but are coupled in two ways that matter
for experimental design:

1. **`kv_cache_dtype=auto` inherits from `dtype`**: when `--kv-cache-dtype auto` (the default),
   vLLM sets the KV cache precision equal to the model's activation dtype. Sweeping `--dtype`
   without pinning `--kv-cache-dtype` therefore changes *both* weight/activation memory *and* KV
   cache precision simultaneously — confounding the measurement. To isolate each effect, pin the
   other argument explicitly.

2. **Quantized weights de-quantize to `dtype` for compute**: weights stored in INT4 (AWQ) or FP8
   are expanded back to the `dtype` precision (e.g., BF16) during the forward pass. Activation
   memory therefore depends on `dtype`, not on `quantization` format. A change in activation
   memory observed between quantized and unquantized runs indicates a problem with the de-quant
   path, not an expected quantization effect.

Sweep design consequence: all three sweeps below pin the arguments they are *not* varying to
`float16` to avoid cross-contamination.

### `--dtype`

Controls the precision used for model weights and activations during computation. `auto` matches
the dtype the model was originally saved in (typically BF16 for modern models).

**Hold fixed**: `--kv-cache-dtype float16` (pin explicitly to prevent KV cache precision from
also changing when dtype changes — see interaction note above).

| Sweep values | Models |
|---|---|
| `float16` (baseline), `bfloat16` | Llama-3.1-8B-Instruct (TP=1) |

BF16 ≈ FP16 in weight memory (both are 2 bytes/parameter — same bit-width,
different exponent range). Activation memory should also be equal. The planner's weight memory
formula reads dtype from safetensors metadata; verify it correctly handles a model loaded in a
dtype different from the one it was saved in. KV cache block count must be identical across
dtype values because `--kv-cache-dtype float16` is pinned.

### `--quantization`

Specifies the compression scheme applied to model weights at load time. Quantized weights are
stored in low precision (INT4/INT8/FP8) but de-quantized back to the `dtype` precision during
actual math operations — so **activation memory depends on `dtype`, not on quantization format**.

**Hold fixed**: `--dtype float16`, `--kv-cache-dtype float16`.

| Sweep values | Models | Notes |
|---|---|---|
| none (baseline) | `meta-llama/Llama-3.1-8B-Instruct` (TP=1) | FP16 weights |
| `fp8` | `redhatai/Llama-3.1-8B-Instruct-quantized.w8a8` (TP=1) | FP8 weights; ~2× compression expected |
| `awq` | `redhatai/Llama-3.1-8B-Instruct-quantized.w4a16` (TP=1) | INT4 weights; ~4× compression expected |

Weight memory scales with quantization bit-width — AWQ INT4 ≈ 25% of FP16,
FP8 ≈ 50%. Activation memory should be **equal across all three** because de-quantization
restores the `dtype=float16` precision for compute; any divergence in activation memory is a
finding to investigate. KV cache block count should be identical (kv_cache_dtype pinned).
The planner reads raw safetensors tensor dtypes; verify it sums quantized tensor sizes correctly
rather than treating all parameters as the model's nominal dtype.

Note: the primary matrix already includes `redhatai/Llama-3.3-70B-Instruct-quantized.w8a8`
for large-scale quantization coverage. This sweep isolates quantization format effects on a
single controlled model.

### `--kv-cache-dtype`

Controls the precision used for the KV cache independently of weight and activation dtype.
`auto` inherits from the model's `dtype` — so with the defaults, KV cache is FP16 for FP16
models and BF16 for BF16 models.

**Hold fixed**: `--dtype float16` (ensures `auto` baseline resolves to FP16, making the
comparison clean: `auto`=FP16 vs explicit `fp8`).

| Sweep values | Models |
|---|---|
| `auto` (=FP16 with dtype=float16), `fp8` | Llama-3.1-8B-Instruct (TP=1), Qwen2.5-7B-Instruct (TP=1) |

Switching from FP16 to FP8 halves the per-token KV cache byte cost, so GPU KV
block count should approximately double at the same memory budget:

```
per_token_bytes (FP16) = num_hidden_layers × 2 × num_kv_heads × head_dim × 2
per_token_bytes (FP8)  = num_hidden_layers × 2 × num_kv_heads × head_dim × 1
```

Weight memory and activation memory must be identical across both runs (dtype is pinned).
The planner's `KVCacheDetail` formula currently derives `precision_bytes` from the model's
weight dtype — verify it accepts `kv_cache_dtype` as an explicit override. This is the most
operationally important test in this section: FP8 KV cache is a common production optimization
and a planner that ignores `kv_cache_dtype` will systematically underestimate available KV
memory for any deployment using it.

---

## Hardware

- **Primary**: H100 80GB SXM (or PCIe)
- **Secondary**: A100 80GB, if available
- `gpu_memory_utilization = 0.95` (vLLM default) for all runs

---

## vLLM Configuration

Use a consistent, minimal vLLM configuration to isolate baseline memory behavior:

```bash
vllm serve <model> \
  --tensor-parallel-size <TP> \
  --gpu-memory-utilization 0.95 \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --max-model-len 8192   # fixed context window to control KV pool size
```

Record the exact vLLM version tag and commit SHA in all output files. Re-run if a new minor
version is released before the benchmarking campaign completes.

---

## Measurement Procedure

For each `(model, GPU, vllm_args)` run:

1. Launch vLLM and capture the full startup log to a file
2. Confirm clean startup — abort and flag the run if the log contains any of:
   `offloading`, `cudaMalloc failed`, `insufficient memory`, `retrying allocation`
3. Extract from the startup log:
   - Weight memory: `torch allocated` reported after weights load
   - Activation memory: `activation_memory_used_gb` from the memory profiling summary
     (logged after the dummy warmup forward pass)
   - Non-torch overhead: `non_torch_memory_used_gb` from the same summary line
   - KV block count: `# GPU blocks: N`
4. Compute actual KV pool size:
   `actual_kv_gib = N × 16 × per_token_bytes_from_model_config / 1024³`
5. Run planner's `allocatable_kv_cache_memory()` with matching parameters
6. Record all values plus `log_path` and `timestamp` in a structured JSON output file per run

**PP runs**: vLLM assigns contiguous layer ranges to each pipeline stage; each stage emits its own
memory profiling summary. Collect per-stage logs and record weight/activation/non-torch for each
stage separately before summing to cross-check against the single-node total.

**DP runs**: each DP replica is an independent process — collect one log per replica. All replicas
should produce identical measurements; flag any divergence.

---

## Reporting

### Per-run output (JSON)

```json
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "gpu": "H100-80GB",
  "vllm_args": {
    "tensor_parallel_size": 1,
    "pipeline_parallel_size": 1,
    "data_parallel_size": 1,
    "max_model_len": 8192,
    "gpu_memory_utilization": 0.95,
    "dtype": "auto",
    "quantization": null,
    "kv_cache_dtype": "auto"
  },
  "vllm_version": "0.x.y",
  "vllm_commit": "abc123",
  "timestamp": "2025-04-17T10:23:00Z",
  "duration_seconds": 142,
  "log_path": "logs/qwen2.5-7b-h100-tp1.log",
  "measured": {
    "weight_memory_gib": 14.2,
    "activation_memory_gib": 5.3,
    "non_torch_memory_gib": 0.14,
    "kv_cache_blocks": 12480,
    "kv_cache_gib": 58.1
  },
  "planner_predicted": {
    "weight_memory_gib": 14.2,
    "activation_memory_gib": 5.6,
    "non_torch_memory_gib": 0.15,
    "kv_cache_gib": 57.6
  },
  "error_pct": {
    "weight_memory": 0.0,
    "activation_memory": 5.7,
    "non_torch_memory": 7.1,
    "kv_cache": -0.9
  }
}
```

### Summary report (Markdown)

Publish as `docs/benchmarks/memory-validation-report.md` containing:

- **Per-component error table**: rows = models, columns = weight / activation / non-torch / KV cache allocatable, cells = `error%`
- **Per-architecture error table**: grouped by architecture class, showing mean and max absolute error per component
- **TP sensitivity table**: for each model, show how activation and non-torch overhead vary with TP
- **Argument sensitivity tables** (one per argument from the vLLM Argument Sensitivity section):
  - `max_model_len`: KV blocks vs. context length — confirm linearity or document deviation
  - PP: per-GPU weight / activation / non-torch / KV at each PP value; note planner formula gap
  - DP: per-replica measurements across DP values; document whether DP multiplier is per-GPU or cluster-total
  - `dtype`: weight and activation memory vs. precision; verify planner dtype path
  - `quantization`: weight memory compression ratio vs. format (AWQ, FP8); verify safetensors metadata path
  - `kv_cache_dtype`: KV block count vs. cache precision (FP16 vs FP8); confirm ~2× block count for FP8
- **Outliers**: any model where any component exceeds ±10% error — root cause analysis required
- **Calibration decisions**: document each constant changed, old value → new value, evidence

---

## Deliverables

- [ ] Measurement scripts added to `scripts/validate_memory/`
- [ ] Raw JSON results for all primary `(model, GPU, TP)` runs and argument sensitivity sweeps in `data/benchmarks/memory/`
- [ ] Published summary report at `docs/benchmarks/memory-validation-report.md`
- [ ] Calibration PR: update constants in `capacity_planner.py` and `VALIDATED_ACTIVATION_PROFILES`
  where any component exceeds ±10% error
- [ ] Unit tests: parameterized `@pytest.mark.unit` tests asserting KV cache block count formula
  for at least one dense, one MoE, and one MLA model config (using mocked HF config, no GPU required)
- [ ] Publish an accuracy blog article for the community

---

## Out of Scope

- Latency or throughput accuracy
- Multi-node / multi-host tensor parallelism
- Prefix caching effects
- Speculative decoding
- Roofline estimation accuracy

---

## Related

- `src/planner/capacity_planner.py` — all memory and KV cache formulas
- `src/planner/recommendation/estimator.py:209` — uses `check_model_fits_gpu` for memory feasibility
- `src/planner/recommendation/config_finder.py` — calls estimator for configs without benchmarks

---

## References

- [vllm_config_estimator](https://github.com/jhurlocker/vllm_config_estimator) — GPU memory and
  KV cache estimation tool for vLLM deployments; useful comparison point for formula validation
- [GPU Calculator](https://nb-qbits.github.io/gpu-calc/) — interactive GPU memory calculator for
  LLM inference; reference for cross-checking capacity planning outputs


Architecture class	Constant (GiB)	Status
`LlamaForCausalLM`	4.8	Validated profile
`Qwen2ForCausalLM`	5.6	Validated profile
`Qwen3ForCausalLM`	5.6	Validated profile (assumed same as Qwen2)
`GemmaForCausalLM`	5.5	Dense fallback
`PhiForCausalLM`	5.5	Dense fallback
`GraniteForCausalLM`	5.5	Dense fallback
`QWenLMHeadModel`	5.5	Dense fallback
`Qwen3MoeForCausalLM`	8.0	MoE fallback
`MistralForCausalLM` / `MixtralForCausalLM`	8.0	MoE fallback
`PixtralForConditionalGeneration`	2.5	Validated multimodal profile

Model	Architecture class	Type	Notes
`meta-llama/Llama-3.1-8B-Instruct`	`LlamaForCausalLM`	Dense	Has validated profile (4.8 GiB) — primary baseline
`meta-llama/Llama-3.3-70B-Instruct`	`LlamaForCausalLM`	Dense	Tests validated constant at 70B scale
`meta-llama/Llama-4-Scout-17B-16E-Instruct`	`LlamaForCausalLM` (MoE variant)	MoE	Llama-based MoE — architecturally distinct from Mixtral; uses MoE fallback
`redhatai/Llama-3.3-70B-Instruct-quantized.w8a8`	`LlamaForCausalLM`	Dense quantized	Tests quantized weight memory path

Model	Architecture class	Type	Notes
`Qwen/Qwen-7B-Chat`	`QWenLMHeadModel`	Dense	Qwen1; no validated profile, uses dense fallback
`Qwen/Qwen2.5-7B-Instruct`	`Qwen2ForCausalLM`	Dense	Has validated profile (5.6 GiB)
`Qwen/Qwen2.5-72B-Instruct`	`Qwen2ForCausalLM`	Dense	Tests validated constant at 72B scale
`Qwen/Qwen3-8B`	`Qwen3ForCausalLM`	Dense	Validated profile assumed equal to Qwen2 — confirm
`Qwen/Qwen3-30B-A3B`	`Qwen3MoeForCausalLM`	MoE	128 experts, 8 active; uses MoE fallback constant
Qwen-next (latest instruct at time of benchmarking)	TBD	TBD	Add validated profile if a new architecture class is introduced

Model	Architecture class	Type	Notes
`mistralai/Mixtral-8x7B-Instruct-v0.1`	`MixtralForCausalLM`	MoE	8 experts, 2 active; primary MoE fallback baseline
`mistralai/Mistral-Small-3.1-24B-Instruct-2503`	`MistralForCausalLM`	Dense	Tests dense fallback for Mistral architecture at 24B

Model	Architecture class	Type	Notes
`google/gemma-3-4b-it`	`Gemma3ForCausalLM`	Dense	Dense fallback, small
`google/gemma-3-27b-it`	`Gemma3ForCausalLM`	Dense	Dense fallback, large — tests constant under memory pressure

Model	Architecture class	Type	Notes
`ibm-granite/granite-3.1-2b-instruct`	`GraniteForCausalLM`	Dense	Smallest variant; activation overhead ratio is highest relative to model size here
`ibm-granite/granite-3.1-8b-instruct`	`GraniteForCausalLM`	Dense	Dense fallback baseline
`ibm-granite/granite-3.3-8b-instruct`	`GraniteForCausalLM`	Dense	Newer generation — confirms constant did not change across 3.1 → 3.3
`ibm-granite/granite-4.0-8b-instruct` (or latest 4.x at benchmarking time)	TBD	Dense	If 4.x introduces a new architecture class, add a validated profile; otherwise confirms 3.x constant carries forward
`ibm-granite/granite-vision-3.3-2b`	`LlavaNextForConditionalGeneration`	Multimodal	Multimodal fallback (2.5 GiB); second architecture in this class after Pixtral — tests generalizability of the constant

Model	Architecture class	Type	Notes
`openai/gpt-oss-20b`	`GptOssForCausalLM`	MoE	32 experts, 4 active; 21B total / 3.6B active
`openai/gpt-oss-120b`	`GptOssForCausalLM`	MoE	117B total / 5.1B active

Model size	TP values to test
≤8B dense	TP=1, TP=2, TP=4
14B–32B dense	TP=1, TP=2, TP=4
70B–72B dense	TP=2, TP=4, TP=8
MoE (30B–70B active)	TP=1, TP=2, TP=4
Quantized 70B	TP=1, TP=2, TP=4

Sweep values	Models	Notes
none (baseline)	`meta-llama/Llama-3.1-8B-Instruct` (TP=1)	FP16 weights
`fp8`	`redhatai/Llama-3.1-8B-Instruct-quantized.w8a8` (TP=1)	FP8 weights; ~2× compression expected
`awq`	`redhatai/Llama-3.1-8B-Instruct-quantized.w4a16` (TP=1)	INT4 weights; ~4× compression expected

Extensive validation of capacity planner memory and KV cache calculations against real vLLM #194

Description

Extensive validation of capacity planner memory and KV cache calculations against real vLLM

Background

What to Validate

1. Model weight memory (model_memory_req, line 553)

2. Activation memory (estimate_vllm_activation_memory, line 380)

3. KV cache per-token memory (KVCacheDetail, line 173)

4. Non-torch memory overhead (estimate_vllm_non_torch_memory, line 29)

5. Total allocatable KV cache (allocatable_kv_cache_memory, line 855)

Model Matrix

Meta / Llama family

Qwen family (full generational sweep)

Mistral family

Microsoft / Phi family

Google / Gemma family

IBM / Granite family

MLA coverage (DeepSeek)

OpenAI OSS family

Tensor Parallelism Matrix

vLLM Argument Sensitivity

--max-model-len

--pipeline-parallel-size (PP)

--data-parallel-size (DP)

Interaction summary

--dtype

--quantization

--kv-cache-dtype

Hardware

vLLM Configuration

Measurement Procedure

Reporting

Per-run output (JSON)

Summary report (Markdown)

Deliverables

Out of Scope

Related

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Model weight memory (`model_memory_req`, line 553)

2. Activation memory (`estimate_vllm_activation_memory`, line 380)

3. KV cache per-token memory (`KVCacheDetail`, line 173)

4. Non-torch memory overhead (`estimate_vllm_non_torch_memory`, line 29)

5. Total allocatable KV cache (`allocatable_kv_cache_memory`, line 855)

`--max-model-len`

`--pipeline-parallel-size` (PP)

`--data-parallel-size` (DP)

`--dtype`

`--quantization`

`--kv-cache-dtype`