Skip to content

Logos: Use remaining GPUs during calibration on heterogeneous nodes #592

@wasnertobias

Description

@wasnertobias

Problem

Calibration today acquires the whole worker node for its run, gated by the orchestrator's _provider_has_active_requests check. On nodes with a non-power-of-two GPU count (e.g. 3 GPUs), the tensor-parallel size is rounded down to the largest power of two (tp=2), so the leftover GPU(s) sit idle for the entire calibration window — often several minutes per model.

Background discussion: this came up while implementing the model-level unsupported-list and node-health features. The trigger was the realization that 1 of 3 cards on some workers contributes zero throughput during the nightly maintenance window.

Proposed change

Allow production lanes to keep serving requests on GPUs the calibration isn't using:

  1. Pin calibration to a specific gpu_devices subset at plan time (today most plans leave it as "all", which makes the worker target every GPU). Pick the largest power-of-two slice — typically GPUs 0..tp-1.
  2. Teach the lane manager to refuse new lanes that would touch the calibration's GPU subset, but allow lanes on the leftover GPU(s).
  3. Tighten VRAM measurement to only sum over the calibration's GPUs. sample_vram_mb(gpu_indices) already respects this; the gap is making sure the calibration plan passes a concrete gpu_devices value (not blank/all) so the measurement set is well-defined. Verify and add a regression test.
  4. Relax _provider_has_active_requests so it only counts requests on the calibration's GPU subset — requests on the leftover GPUs no longer block scheduling.

Open design choices

  • TP fallback during calibration: today the planner has a tp-escalation/fallback path. If calibration is using GPUs 0–1 and a request needs Qwen3-Embedding-8B (whose calibrated profile is tp=2), do we let it spawn at tp=1 on GPU 2 as a fallback, or just queue/reject?
    • Recommended: leave the planner's existing tp-fallback alone. Only models whose smallest workable tp fits on the leftover GPU(s) get served during calibration. Everything else queues.
  • NCCL topology: workers without NVLink already set NCCL_P2P_DISABLE=1 (visible in calibration logs). Concurrent vLLM on the leftover GPU shouldn't interact with the calibration's NCCL ring, but worth verifying on a real 3-GPU host before flipping the default on.

Risks

  • VRAM baseline contamination: if the calibration plan doesn't pin gpu_devices, query_gpu_vram(None) sums all GPUs and the production lane's footprint corrupts the calibration's measurement (silently). The pinning step (1) is load-bearing.
  • Power draw / thermals: running calibration + production simultaneously on the same chassis hits PSU/thermal limits harder than nightly-window-only calibration. Worth a watch when first enabled.

Out of scope

  • Cross-node coordination (still one calibration per worker at a time).
  • Reordering calibration to prefer the largest workable tp first vs. last (separate optimization).

Acceptance criteria

  • A 3-GPU node calibrating at tp=2 can simultaneously serve requests on the remaining GPU.
  • Calibration's base_residency_mb measurement is unaffected by what runs on the leftover GPU(s) — regression test that fakes a concurrent lane.
  • _provider_has_active_requests no longer treats lanes on leftover GPUs as "busy" for the purpose of calibration eligibility.

🤖 Filed via Claude Code based on a discussion during feature-#3 implementation. Original session context: deioma 2026-06-04 incident + heterogeneous-GPU optimization brainstorm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions