Skip to content

[CB] Sum GPU memory across devices for multi-GPU KV cache validation#3496

Open
ambeckley wants to merge 1 commit intoopenvinotoolkit:masterfrom
ambeckley:fix/multi-gpu-kv-cache-memory-sum
Open

[CB] Sum GPU memory across devices for multi-GPU KV cache validation#3496
ambeckley wants to merge 1 commit intoopenvinotoolkit:masterfrom
ambeckley:fix/multi-gpu-kv-cache-memory-sum

Conversation

@ambeckley
Copy link

Summary

When using HETERO pipeline-parallel with multiple GPUs (e.g., HETERO:GPU.0,GPU.1), the KV cache memory validation in pipeline_impl.cpp only queries the first execution device via get_available_gpu_memory(execution_devices[0]). This causes cache_size validation to see only one GPU's memory instead of the combined total across all GPUs.

This fix uses the all_gpu_device variable (introduced in PR #2227 for pipeline-parallel support) to detect multi-GPU configurations and sum available memory across all execution devices.

Changes

  • src/cpp/src/continuous_batching/pipeline_impl.cpp: When all_gpu_device && execution_devices.size() > 1, loop over all execution devices and sum get_available_gpu_memory() results into total_mem_size. Single-GPU and CPU paths are unchanged.

Example

With 2x Intel Arc Pro B50 (16GB each) and cache_size=7:

  • Before: checks only GPU.0 → 16GB available (works by luck for small cache sizes, but cache_size=14 would fail despite 32GB total)
  • After: sums GPU.0 + GPU.1 → 32GB available (correct)

Test environment

  • Windows 10 Pro, 2x Intel Arc Pro B50 (16GB VRAM each)
  • Driver: 32.0.101.8314
  • OVMS 2026.0 with --target_device "HETERO:GPU.1,GPU.0" --model_distribution_policy "PIPELINE_PARALLEL" --cache_size 7
  • Model: Qwen2.5-7B-Instruct INT8 symmetric

Related

When using HETERO pipeline-parallel with multiple GPUs (e.g.,
HETERO:GPU.0,GPU.1), the KV cache memory validation only queries
the first execution device via get_available_gpu_memory(). This
causes cache_size validation to undercount available memory, since
the model's KV cache is distributed across all GPUs.

Fix: When all_gpu_device is true and multiple execution devices
are present, loop over all devices and sum their available memory.
This uses the all_gpu_device variable introduced in PR openvinotoolkit#2227
(pipeline-parallel support) which was not previously used for
memory aggregation.

Example with 2x Intel Arc Pro B50 (16GB each), cache_size=7:
  Before: checks only GPU.0 (16GB available) — works by luck
  Before: with cache_size=14 — fails validation despite 32GB total
  After:  sums GPU.0 + GPU.1 (32GB available) — correct

Single-GPU and CPU paths are unchanged.

Tested on Windows 10 with 2x Intel Arc Pro B50 GPUs (driver
32.0.101.8314) using OVMS 2026.0 with HETERO:GPU.1,GPU.0
and PIPELINE_PARALLEL model distribution.

Signed-off-by: Aaron Beckley <ambeckley@users.noreply.github.com>
@ambeckley ambeckley requested a review from popovaan as a code owner March 14, 2026 20:11
@github-actions github-actions bot added the category: continuous batching Continuous batching label Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant