`Logos`: Use remaining GPUs during calibration on heterogeneous nodes

## Problem

Calibration today acquires the whole worker node for its run, gated by the orchestrator's `_provider_has_active_requests` check. On nodes with a non-power-of-two GPU count (e.g. 3 GPUs), the tensor-parallel size is rounded down to the largest power of two (tp=2), so the leftover GPU(s) sit idle for the entire calibration window — often several minutes per model.

Background discussion: this came up while implementing the model-level unsupported-list and node-health features. The trigger was the realization that 1 of 3 cards on some workers contributes zero throughput during the nightly maintenance window.

## Proposed change

Allow production lanes to keep serving requests on GPUs the calibration isn't using:

1. **Pin calibration to a specific `gpu_devices` subset** at plan time (today most plans leave it as `"all"`, which makes the worker target every GPU). Pick the largest power-of-two slice — typically GPUs 0..tp-1.
2. **Teach the lane manager** to refuse new lanes that would touch the calibration's GPU subset, but allow lanes on the leftover GPU(s).
3. **Tighten VRAM measurement** to only sum over the calibration's GPUs. `sample_vram_mb(gpu_indices)` already respects this; the gap is making sure the calibration plan passes a concrete `gpu_devices` value (not blank/all) so the measurement set is well-defined. Verify and add a regression test.
4. **Relax `_provider_has_active_requests`** so it only counts requests on the calibration's GPU subset — requests on the leftover GPUs no longer block scheduling.

## Open design choices

- **TP fallback during calibration**: today the planner has a tp-escalation/fallback path. If calibration is using GPUs 0–1 and a request needs Qwen3-Embedding-8B (whose calibrated profile is tp=2), do we let it spawn at tp=1 on GPU 2 as a fallback, or just queue/reject?
  - Recommended: leave the planner's existing tp-fallback alone. Only models whose smallest workable tp fits on the leftover GPU(s) get served during calibration. Everything else queues.
- **NCCL topology**: workers without NVLink already set `NCCL_P2P_DISABLE=1` (visible in calibration logs). Concurrent vLLM on the leftover GPU shouldn't interact with the calibration's NCCL ring, but worth verifying on a real 3-GPU host before flipping the default on.

## Risks

- **VRAM baseline contamination**: if the calibration plan doesn't pin `gpu_devices`, `query_gpu_vram(None)` sums all GPUs and the production lane's footprint corrupts the calibration's measurement (silently). The pinning step (1) is load-bearing.
- **Power draw / thermals**: running calibration + production simultaneously on the same chassis hits PSU/thermal limits harder than nightly-window-only calibration. Worth a watch when first enabled.

## Out of scope

- Cross-node coordination (still one calibration per worker at a time).
- Reordering calibration to prefer the largest workable tp first vs. last (separate optimization).

## Acceptance criteria

- A 3-GPU node calibrating at tp=2 can simultaneously serve requests on the remaining GPU.
- Calibration's `base_residency_mb` measurement is unaffected by what runs on the leftover GPU(s) — regression test that fakes a concurrent lane.
- `_provider_has_active_requests` no longer treats lanes on leftover GPUs as "busy" for the purpose of calibration eligibility.

---

🤖 Filed via Claude Code based on a discussion during feature-#3 implementation. Original session context: deioma 2026-06-04 incident + heterogeneous-GPU optimization brainstorm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Logos`: Use remaining GPUs during calibration on heterogeneous nodes #592

Problem

Proposed change

Open design choices

Risks

Out of scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Logos: Use remaining GPUs during calibration on heterogeneous nodes #592

Description

Problem

Proposed change

Open design choices

Risks

Out of scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Logos`: Use remaining GPUs during calibration on heterogeneous nodes #592