[CB] Sum GPU memory across devices for multi-GPU KV cache validation#3496
Open
ambeckley wants to merge 1 commit intoopenvinotoolkit:masterfrom
Open
[CB] Sum GPU memory across devices for multi-GPU KV cache validation#3496ambeckley wants to merge 1 commit intoopenvinotoolkit:masterfrom
ambeckley wants to merge 1 commit intoopenvinotoolkit:masterfrom
Conversation
When using HETERO pipeline-parallel with multiple GPUs (e.g., HETERO:GPU.0,GPU.1), the KV cache memory validation only queries the first execution device via get_available_gpu_memory(). This causes cache_size validation to undercount available memory, since the model's KV cache is distributed across all GPUs. Fix: When all_gpu_device is true and multiple execution devices are present, loop over all devices and sum their available memory. This uses the all_gpu_device variable introduced in PR openvinotoolkit#2227 (pipeline-parallel support) which was not previously used for memory aggregation. Example with 2x Intel Arc Pro B50 (16GB each), cache_size=7: Before: checks only GPU.0 (16GB available) — works by luck Before: with cache_size=14 — fails validation despite 32GB total After: sums GPU.0 + GPU.1 (32GB available) — correct Single-GPU and CPU paths are unchanged. Tested on Windows 10 with 2x Intel Arc Pro B50 GPUs (driver 32.0.101.8314) using OVMS 2026.0 with HETERO:GPU.1,GPU.0 and PIPELINE_PARALLEL model distribution. Signed-off-by: Aaron Beckley <ambeckley@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When using
HETEROpipeline-parallel with multiple GPUs (e.g.,HETERO:GPU.0,GPU.1), the KV cache memory validation inpipeline_impl.cpponly queries the first execution device viaget_available_gpu_memory(execution_devices[0]). This causescache_sizevalidation to see only one GPU's memory instead of the combined total across all GPUs.This fix uses the
all_gpu_devicevariable (introduced in PR #2227 for pipeline-parallel support) to detect multi-GPU configurations and sum available memory across all execution devices.Changes
src/cpp/src/continuous_batching/pipeline_impl.cpp: Whenall_gpu_device && execution_devices.size() > 1, loop over all execution devices and sumget_available_gpu_memory()results intototal_mem_size. Single-GPU and CPU paths are unchanged.Example
With 2x Intel Arc Pro B50 (16GB each) and
cache_size=7:GPU.0→ 16GB available (works by luck for small cache sizes, butcache_size=14would fail despite 32GB total)GPU.0 + GPU.1→ 32GB available (correct)Test environment
--target_device "HETERO:GPU.1,GPU.0" --model_distribution_policy "PIPELINE_PARALLEL" --cache_size 7Related
all_gpu_devicedetection and pipeline-parallel supportget_available_gpu_memory()utility function