fix(visualizer): match Vulkan physical device to CUDA device on multi-GPU systems by iliesaya · Pull Request #1315 · MrNeRF/LichtFeld-Studio

iliesaya · 2026-06-17T13:34:33Z

Summary

On multi-GPU systems, starting GUI training crashes immediately with a misleading
"CUDA out of memory" error - even with tens of GB of VRAM free. The root cause is
that the Vulkan viewer and the CUDA trainer can end up on different physical GPUs,
which breaks the CUDA↔Vulkan zero-copy interop.

Root cause

VulkanContext::pickPhysicalDevice() selects the first discrete GPU in Vulkan's
enumeration order and never reconciles it with the GPU CUDA uses (device 0). On
machines with two GPUs - especially two identical cards - Vulkan's enumeration order
does not necessarily match CUDA's, so the viewer initializes on one card and the
trainer on the other.

When training starts, the exportable-interop allocator exports a CUDA VMM block on the
CUDA device and tries to import it into Vulkan on the other card:

exportable_storage.cpp Exportable CUDA block: device_ptr=0x... committed=1184 MiB
vulkan_context.cpp Vulkan: vkAllocateMemory(import) failed: VK_ERROR_OUT_OF_DEVICE_MEMORY
training_manager.cpp Exportable-interop allocator failed (...); falling back to legacy Vulkan-external allocator
pinned_memory_allocator.cpp cudaEventQuery failed: an illegal memory access was encountered
tensor.cpp cudaErrorIllegalAddress: an illegal memory access was encountered
training_manager.cpp Failed to initialize SplatData: CUDA out of memory: failed to allocate 21033300 bytes (0.02 GB).

The import fails with VK_ERROR_OUT_OF_DEVICE_MEMORY; the legacy fallback then performs
cross-device CUDA work that raises cudaErrorIllegalAddress, which poisons the CUDA
context so every subsequent allocation fails. The allocator surfaces those failures as
"out of memory" — hence the misleading message on a nearly empty GPU.

Notably, verifyCudaMatchesVulkanDevice() in cuda_vulkan_interop.cpp already detects
this exact mismatch and suggests setting CUDA_VISIBLE_DEVICES, confirming the two APIs
can diverge — this PR makes the selection align automatically.

Fix

In pickPhysicalDevice(), prefer the discrete GPU whose UUID matches CUDA device 0
(via VkPhysicalDeviceIDProperties::deviceUUID vs cudaDeviceProp::uuid). If no match
is found, fall back to the previous "first discrete GPU" behavior, so single-GPU
systems are completely unaffected.

Reproduction

Hardware: 2× identical NVIDIA GPUs (reproduced on 2× RTX 4090, driver 572.61, CUDA 12.8).
Load any COLMAP/transforms dataset and start training in the GUI → instant
"CUDA out of memory" crash.
Workaround without this patch: set CUDA_VISIBLE_DEVICES=<index matching the Vulkan card>.

Testing

Before: GUI training crashes at model init with the log above (GPU ~0.5 GB used of 24 GB).
After: Vulkan selects the CUDA-matched 4090, interop succeeds
(Training tensors share one CUDA-exportable VMM block imported into Vulkan — zero-copy viewer interop), and training runs normally with no env var.
Single-GPU path unchanged (fallback preserves prior behavior).

…-GPU pickPhysicalDevice() selected the first discrete GPU in Vulkan's enumeration order without regard to which GPU CUDA uses. On multi-GPU systems (especially two identical cards), Vulkan's order can differ from CUDA's, so the viewer initializes on a different physical GPU than the trainer. The CUDA<->Vulkan zero-copy interop then exports a memory block on the CUDA device and tries to import it into Vulkan on the other card. The import fails with VK_ERROR_OUT_OF_DEVICE_MEMORY, the legacy fallback path performs cross-device CUDA work that raises cudaErrorIllegalAddress, and the poisoned CUDA context makes every subsequent allocation fail. The user-visible result is an immediate, misleading "CUDA out of memory" crash on a GPU with tens of GB free. Prefer the discrete GPU whose UUID matches CUDA device 0, falling back to the previous "first discrete GPU" behavior when no match is found so single-GPU systems are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates Vulkan physical device selection to prefer the discrete GPU that matches the CUDA device UUID, improving CUDA↔Vulkan external-memory interop reliability on multi-GPU systems.

Changes:

Added a UUID-based matcher between Vulkan physical devices and a CUDA device.
Updated GPU selection to prefer the CUDA-matched discrete GPU while preserving legacy fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                if (vulkanDeviceMatchesCudaDevice(device, 0)) {
+                    physical_device_ = device;
+                    break;


+        [[nodiscard]] bool vulkanDeviceMatchesCudaDevice(const VkPhysicalDevice device, const int cuda_device) {
+            cudaDeviceProp cuda_props{};
+            if (cudaGetDeviceProperties(&cuda_props, cuda_device) != cudaSuccess) {
+                return false;
+            }
+            VkPhysicalDeviceIDProperties vk_id{};
+            vk_id.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ID_PROPERTIES;
+            VkPhysicalDeviceProperties2 vk_props2{};
+            vk_props2.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
+            vk_props2.pNext = &vk_id;
+            vkGetPhysicalDeviceProperties2(device, &vk_props2);
+            static_assert(sizeof(cuda_props.uuid.bytes) == VK_UUID_SIZE);
+            return std::memcmp(cuda_props.uuid.bytes, vk_id.deviceUUID, VK_UUID_SIZE) == 0;
+        }


+        // cards, Vulkan's enumeration order can differ from CUDA's; matching by
+        // UUID lets pickPhysicalDevice keep the viewer on the same card as the
+        // trainer so CUDA<->Vulkan external-memory interop can import the block.
+        [[nodiscard]] bool vulkanDeviceMatchesCudaDevice(const VkPhysicalDevice device, const int cuda_device) {


MrNeRF · 2026-06-18T09:52:45Z

Thx!

Copilot AI review requested due to automatic review settings June 17, 2026 13:34

Copilot AI reviewed Jun 17, 2026

View reviewed changes

shadygm self-requested a review June 17, 2026 18:32

MrNeRF merged commit e2182b4 into MrNeRF:master Jun 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(visualizer): match Vulkan physical device to CUDA device on multi-GPU systems#1315

fix(visualizer): match Vulkan physical device to CUDA device on multi-GPU systems#1315
MrNeRF merged 1 commit into
MrNeRF:masterfrom
iliesaya:fix/multi-gpu-cuda-vulkan-device-match

iliesaya commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

MrNeRF commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iliesaya commented Jun 17, 2026

Summary

Root cause

Fix

Reproduction

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

MrNeRF commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants