`Logos`: Capacity planner / vLLM VRAM accounting mismatch on multi-GPU workers

## What

On a multi-GPU worker (`deioma`: 3× RTX PRO 6000 Blackwell Max-Q, ~96 GiB each), a `tp=2` lane for `google/gemma-3-27b-it` fails to start with:

```
ValueError: Free memory on device cuda:0 (79.12/94.97 GiB) on startup is less
than desired GPU memory utilization (0.95, 90.22 GiB).
```

…even though GPUs 1 and 2 were both ~empty (~96 GiB free each) and would have hosted gemma fine. The planner placed gemma on GPUs (0,1) where GPU 0 already had ~16 GiB used by a `deepseek-ai/DeepSeek-OCR-2` lane.

## Why

Two interacting pieces in `logos-workernode/logos_worker_node/lane_manager.py`:

1. **`_pick_best_gpu_subset` (line 1219)** does *best-fit / tight-pack* by `leftover = Σ(free − required)`, score sorted ascending (line 1240). Intentional — keeps wide free slots for future big models.

2. **`_estimate_lane_vram_mb`** returns `base_residency_mb + kv_cache_memory_bytes` (~45 GiB/GPU for gemma). But the actual vLLM reservation is governed by `gpu_memory_utilization × per_gpu_total` (0.95 × 97 GiB ≈ 90 GiB). The planner thinks gemma needs 45 GiB/GPU; vLLM tries to grab 90 GiB/GPU.

Concrete numbers for the failing call (DeepSeek-OCR-2 already on cuda:0):

| combo | per-GPU free | leftover after 45 GiB est. | picked by planner |
|---|---|---|---|
| (0, 1) | 81 / 96 | 36 + 51 = **87** | ✅ (smallest leftover wins) |
| (0, 2) | 81 / 97 | 36 + 52 = 88 | |
| (1, 2) | 96 / 97 | 51 + 52 = 103 | |

Best-fit picks (0,1) honestly given the 45 GiB estimate — but vLLM then needs 90 GiB/GPU, GPU 0 only has 81 GiB free → boom.

## Fix options

**A — reconcile estimate with vLLM's actual reservation.** In `_estimate_lane_vram_mb`, when `kv_cache_memory_bytes` is unset, fall back to `gpu_memory_utilization × per_gpu_total_mb` rather than just `base_residency_mb`. Conservative, matches what vLLM actually grabs.

**B — drop `gpu_memory_utilization` from configs.** Always set `kv_cache_memory_bytes` explicitly in `engines.vllm.model_overrides` (already recommended in `config.example.yml:220-233`). Then planner estimate ≡ vLLM reservation. Less invasive, but relies on every operator following the convention.

Probably do both — A as a defensive fix in the planner, B as a config cleanup on existing worker configs.

## Related — minor

While debugging this, also noticed `sync_logosnode_capabilities` in `dbutils/dbmanager.py:1715-1727` INSERTs into `profile_model_permissions`, a table that was dropped in migration 031/032. The transaction silently rolls back, so worker-announced models never auto-land in `models`. Worth a separate small fix (just delete the dead block). Logged as a footnote here in case it lands in the same area of the code.

## Repro env

- Worker: `ghcr.io/ls1intum/edutelligence/logos-workernode-vllm:latest` (commit d111b290)
- Host: deioma, 3× RTX PRO 6000 Blackwell Max-Q, driver 580.142
- vLLM 0.20.0, CUDA 13.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Logos`: Capacity planner / vLLM VRAM accounting mismatch on multi-GPU workers #570

What

Why

Fix options

Related — minor

Repro env

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

combo	per-GPU free	leftover after 45 GiB est.	picked by planner
(0, 1)	81 / 96	36 + 51 = 87	✅ (smallest leftover wins)
(0, 2)	81 / 97	36 + 52 = 88
(1, 2)	96 / 97	51 + 52 = 103

Logos: Capacity planner / vLLM VRAM accounting mismatch on multi-GPU workers #570

Description

What

Why

Fix options

Related — minor

Repro env

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Logos`: Capacity planner / vLLM VRAM accounting mismatch on multi-GPU workers #570