What
On a multi-GPU worker (deioma: 3× RTX PRO 6000 Blackwell Max-Q, ~96 GiB each), a tp=2 lane for google/gemma-3-27b-it fails to start with:
ValueError: Free memory on device cuda:0 (79.12/94.97 GiB) on startup is less
than desired GPU memory utilization (0.95, 90.22 GiB).
…even though GPUs 1 and 2 were both ~empty (~96 GiB free each) and would have hosted gemma fine. The planner placed gemma on GPUs (0,1) where GPU 0 already had ~16 GiB used by a deepseek-ai/DeepSeek-OCR-2 lane.
Why
Two interacting pieces in logos-workernode/logos_worker_node/lane_manager.py:
-
_pick_best_gpu_subset (line 1219) does best-fit / tight-pack by leftover = Σ(free − required), score sorted ascending (line 1240). Intentional — keeps wide free slots for future big models.
-
_estimate_lane_vram_mb returns base_residency_mb + kv_cache_memory_bytes (~45 GiB/GPU for gemma). But the actual vLLM reservation is governed by gpu_memory_utilization × per_gpu_total (0.95 × 97 GiB ≈ 90 GiB). The planner thinks gemma needs 45 GiB/GPU; vLLM tries to grab 90 GiB/GPU.
Concrete numbers for the failing call (DeepSeek-OCR-2 already on cuda:0):
| combo |
per-GPU free |
leftover after 45 GiB est. |
picked by planner |
| (0, 1) |
81 / 96 |
36 + 51 = 87 |
✅ (smallest leftover wins) |
| (0, 2) |
81 / 97 |
36 + 52 = 88 |
|
| (1, 2) |
96 / 97 |
51 + 52 = 103 |
|
Best-fit picks (0,1) honestly given the 45 GiB estimate — but vLLM then needs 90 GiB/GPU, GPU 0 only has 81 GiB free → boom.
Fix options
A — reconcile estimate with vLLM's actual reservation. In _estimate_lane_vram_mb, when kv_cache_memory_bytes is unset, fall back to gpu_memory_utilization × per_gpu_total_mb rather than just base_residency_mb. Conservative, matches what vLLM actually grabs.
B — drop gpu_memory_utilization from configs. Always set kv_cache_memory_bytes explicitly in engines.vllm.model_overrides (already recommended in config.example.yml:220-233). Then planner estimate ≡ vLLM reservation. Less invasive, but relies on every operator following the convention.
Probably do both — A as a defensive fix in the planner, B as a config cleanup on existing worker configs.
Related — minor
While debugging this, also noticed sync_logosnode_capabilities in dbutils/dbmanager.py:1715-1727 INSERTs into profile_model_permissions, a table that was dropped in migration 031/032. The transaction silently rolls back, so worker-announced models never auto-land in models. Worth a separate small fix (just delete the dead block). Logged as a footnote here in case it lands in the same area of the code.
Repro env
- Worker:
ghcr.io/ls1intum/edutelligence/logos-workernode-vllm:latest (commit d111b29)
- Host: deioma, 3× RTX PRO 6000 Blackwell Max-Q, driver 580.142
- vLLM 0.20.0, CUDA 13.1
What
On a multi-GPU worker (
deioma: 3× RTX PRO 6000 Blackwell Max-Q, ~96 GiB each), atp=2lane forgoogle/gemma-3-27b-itfails to start with:…even though GPUs 1 and 2 were both ~empty (~96 GiB free each) and would have hosted gemma fine. The planner placed gemma on GPUs (0,1) where GPU 0 already had ~16 GiB used by a
deepseek-ai/DeepSeek-OCR-2lane.Why
Two interacting pieces in
logos-workernode/logos_worker_node/lane_manager.py:_pick_best_gpu_subset(line 1219) does best-fit / tight-pack byleftover = Σ(free − required), score sorted ascending (line 1240). Intentional — keeps wide free slots for future big models._estimate_lane_vram_mbreturnsbase_residency_mb + kv_cache_memory_bytes(~45 GiB/GPU for gemma). But the actual vLLM reservation is governed bygpu_memory_utilization × per_gpu_total(0.95 × 97 GiB ≈ 90 GiB). The planner thinks gemma needs 45 GiB/GPU; vLLM tries to grab 90 GiB/GPU.Concrete numbers for the failing call (DeepSeek-OCR-2 already on cuda:0):
Best-fit picks (0,1) honestly given the 45 GiB estimate — but vLLM then needs 90 GiB/GPU, GPU 0 only has 81 GiB free → boom.
Fix options
A — reconcile estimate with vLLM's actual reservation. In
_estimate_lane_vram_mb, whenkv_cache_memory_bytesis unset, fall back togpu_memory_utilization × per_gpu_total_mbrather than justbase_residency_mb. Conservative, matches what vLLM actually grabs.B — drop
gpu_memory_utilizationfrom configs. Always setkv_cache_memory_bytesexplicitly inengines.vllm.model_overrides(already recommended inconfig.example.yml:220-233). Then planner estimate ≡ vLLM reservation. Less invasive, but relies on every operator following the convention.Probably do both — A as a defensive fix in the planner, B as a config cleanup on existing worker configs.
Related — minor
While debugging this, also noticed
sync_logosnode_capabilitiesindbutils/dbmanager.py:1715-1727INSERTs intoprofile_model_permissions, a table that was dropped in migration 031/032. The transaction silently rolls back, so worker-announced models never auto-land inmodels. Worth a separate small fix (just delete the dead block). Logged as a footnote here in case it lands in the same area of the code.Repro env
ghcr.io/ls1intum/edutelligence/logos-workernode-vllm:latest(commit d111b29)