Skip to content

Logos: Capacity planner / vLLM VRAM accounting mismatch on multi-GPU workers #570

@wasnertobias

Description

@wasnertobias

What

On a multi-GPU worker (deioma: 3× RTX PRO 6000 Blackwell Max-Q, ~96 GiB each), a tp=2 lane for google/gemma-3-27b-it fails to start with:

ValueError: Free memory on device cuda:0 (79.12/94.97 GiB) on startup is less
than desired GPU memory utilization (0.95, 90.22 GiB).

…even though GPUs 1 and 2 were both ~empty (~96 GiB free each) and would have hosted gemma fine. The planner placed gemma on GPUs (0,1) where GPU 0 already had ~16 GiB used by a deepseek-ai/DeepSeek-OCR-2 lane.

Why

Two interacting pieces in logos-workernode/logos_worker_node/lane_manager.py:

  1. _pick_best_gpu_subset (line 1219) does best-fit / tight-pack by leftover = Σ(free − required), score sorted ascending (line 1240). Intentional — keeps wide free slots for future big models.

  2. _estimate_lane_vram_mb returns base_residency_mb + kv_cache_memory_bytes (~45 GiB/GPU for gemma). But the actual vLLM reservation is governed by gpu_memory_utilization × per_gpu_total (0.95 × 97 GiB ≈ 90 GiB). The planner thinks gemma needs 45 GiB/GPU; vLLM tries to grab 90 GiB/GPU.

Concrete numbers for the failing call (DeepSeek-OCR-2 already on cuda:0):

combo per-GPU free leftover after 45 GiB est. picked by planner
(0, 1) 81 / 96 36 + 51 = 87 ✅ (smallest leftover wins)
(0, 2) 81 / 97 36 + 52 = 88
(1, 2) 96 / 97 51 + 52 = 103

Best-fit picks (0,1) honestly given the 45 GiB estimate — but vLLM then needs 90 GiB/GPU, GPU 0 only has 81 GiB free → boom.

Fix options

A — reconcile estimate with vLLM's actual reservation. In _estimate_lane_vram_mb, when kv_cache_memory_bytes is unset, fall back to gpu_memory_utilization × per_gpu_total_mb rather than just base_residency_mb. Conservative, matches what vLLM actually grabs.

B — drop gpu_memory_utilization from configs. Always set kv_cache_memory_bytes explicitly in engines.vllm.model_overrides (already recommended in config.example.yml:220-233). Then planner estimate ≡ vLLM reservation. Less invasive, but relies on every operator following the convention.

Probably do both — A as a defensive fix in the planner, B as a config cleanup on existing worker configs.

Related — minor

While debugging this, also noticed sync_logosnode_capabilities in dbutils/dbmanager.py:1715-1727 INSERTs into profile_model_permissions, a table that was dropped in migration 031/032. The transaction silently rolls back, so worker-announced models never auto-land in models. Worth a separate small fix (just delete the dead block). Logged as a footnote here in case it lands in the same area of the code.

Repro env

  • Worker: ghcr.io/ls1intum/edutelligence/logos-workernode-vllm:latest (commit d111b29)
  • Host: deioma, 3× RTX PRO 6000 Blackwell Max-Q, driver 580.142
  • vLLM 0.20.0, CUDA 13.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions