Problem
Calibration today acquires the whole worker node for its run, gated by the orchestrator's _provider_has_active_requests check. On nodes with a non-power-of-two GPU count (e.g. 3 GPUs), the tensor-parallel size is rounded down to the largest power of two (tp=2), so the leftover GPU(s) sit idle for the entire calibration window — often several minutes per model.
Background discussion: this came up while implementing the model-level unsupported-list and node-health features. The trigger was the realization that 1 of 3 cards on some workers contributes zero throughput during the nightly maintenance window.
Proposed change
Allow production lanes to keep serving requests on GPUs the calibration isn't using:
- Pin calibration to a specific
gpu_devices subset at plan time (today most plans leave it as "all", which makes the worker target every GPU). Pick the largest power-of-two slice — typically GPUs 0..tp-1.
- Teach the lane manager to refuse new lanes that would touch the calibration's GPU subset, but allow lanes on the leftover GPU(s).
- Tighten VRAM measurement to only sum over the calibration's GPUs.
sample_vram_mb(gpu_indices) already respects this; the gap is making sure the calibration plan passes a concrete gpu_devices value (not blank/all) so the measurement set is well-defined. Verify and add a regression test.
- Relax
_provider_has_active_requests so it only counts requests on the calibration's GPU subset — requests on the leftover GPUs no longer block scheduling.
Open design choices
- TP fallback during calibration: today the planner has a tp-escalation/fallback path. If calibration is using GPUs 0–1 and a request needs Qwen3-Embedding-8B (whose calibrated profile is tp=2), do we let it spawn at tp=1 on GPU 2 as a fallback, or just queue/reject?
- Recommended: leave the planner's existing tp-fallback alone. Only models whose smallest workable tp fits on the leftover GPU(s) get served during calibration. Everything else queues.
- NCCL topology: workers without NVLink already set
NCCL_P2P_DISABLE=1 (visible in calibration logs). Concurrent vLLM on the leftover GPU shouldn't interact with the calibration's NCCL ring, but worth verifying on a real 3-GPU host before flipping the default on.
Risks
- VRAM baseline contamination: if the calibration plan doesn't pin
gpu_devices, query_gpu_vram(None) sums all GPUs and the production lane's footprint corrupts the calibration's measurement (silently). The pinning step (1) is load-bearing.
- Power draw / thermals: running calibration + production simultaneously on the same chassis hits PSU/thermal limits harder than nightly-window-only calibration. Worth a watch when first enabled.
Out of scope
- Cross-node coordination (still one calibration per worker at a time).
- Reordering calibration to prefer the largest workable tp first vs. last (separate optimization).
Acceptance criteria
- A 3-GPU node calibrating at tp=2 can simultaneously serve requests on the remaining GPU.
- Calibration's
base_residency_mb measurement is unaffected by what runs on the leftover GPU(s) — regression test that fakes a concurrent lane.
_provider_has_active_requests no longer treats lanes on leftover GPUs as "busy" for the purpose of calibration eligibility.
🤖 Filed via Claude Code based on a discussion during feature-#3 implementation. Original session context: deioma 2026-06-04 incident + heterogeneous-GPU optimization brainstorm.
Problem
Calibration today acquires the whole worker node for its run, gated by the orchestrator's
_provider_has_active_requestscheck. On nodes with a non-power-of-two GPU count (e.g. 3 GPUs), the tensor-parallel size is rounded down to the largest power of two (tp=2), so the leftover GPU(s) sit idle for the entire calibration window — often several minutes per model.Background discussion: this came up while implementing the model-level unsupported-list and node-health features. The trigger was the realization that 1 of 3 cards on some workers contributes zero throughput during the nightly maintenance window.
Proposed change
Allow production lanes to keep serving requests on GPUs the calibration isn't using:
gpu_devicessubset at plan time (today most plans leave it as"all", which makes the worker target every GPU). Pick the largest power-of-two slice — typically GPUs 0..tp-1.sample_vram_mb(gpu_indices)already respects this; the gap is making sure the calibration plan passes a concretegpu_devicesvalue (not blank/all) so the measurement set is well-defined. Verify and add a regression test._provider_has_active_requestsso it only counts requests on the calibration's GPU subset — requests on the leftover GPUs no longer block scheduling.Open design choices
NCCL_P2P_DISABLE=1(visible in calibration logs). Concurrent vLLM on the leftover GPU shouldn't interact with the calibration's NCCL ring, but worth verifying on a real 3-GPU host before flipping the default on.Risks
gpu_devices,query_gpu_vram(None)sums all GPUs and the production lane's footprint corrupts the calibration's measurement (silently). The pinning step (1) is load-bearing.Out of scope
Acceptance criteria
base_residency_mbmeasurement is unaffected by what runs on the leftover GPU(s) — regression test that fakes a concurrent lane._provider_has_active_requestsno longer treats lanes on leftover GPUs as "busy" for the purpose of calibration eligibility.🤖 Filed via Claude Code based on a discussion during feature-#3 implementation. Original session context: deioma 2026-06-04 incident + heterogeneous-GPU optimization brainstorm.