Name and Version
version: b9453
Operating systems
Windows, Linux
Which llama.cpp modules do you know to be affected?
CUDA/HIP module
Command line
N/A
Problem description & steps to reproduce
Problem
PR #16308 hardcoded info.devices[id].integrated = false for all CUDA/HIP devices to work around corrupted output on Nvidia Jetson Orin Nano (#15034). This is correct for CUDA builds, but it has two unintended side effects for HIP/ROCm builds:
1. supports_buft() reads stale false for AMD APU iGPUs (have dGPU)
ggml_backend_cuda_device_supports_buft() (line 5437) reads from ggml_cuda_info().devices[dev_ctx->device].integrated, which is always false due to the hardcode from PR #16308. This means AMD APU iGPUs never report host buffer support, forcing the discrete-GPU allocation path even on UMA hardware.
PR #23007 fixed get_type() by querying prop.integrated directly from hipGetDeviceProperties(). This bypassed the cached field, so device classification is now correct. But supports_buft() was not updated and still reads false. This leaves two sources of truth in the same file that can contradict each other for the same device.
2. Impact on APU-only systems (no dGPU)
On a system with only an AMD APU (no discrete GPU), the iGPU is the only compute device. After PR #23007 it gets correctly classified as GGML_BACKEND_DEVICE_TYPE_IGPU and added to the device list. But supports_buft() still returns false for host buffers, forcing incorrect allocation strategy on a UMA device.
Root Cause & Proposed Change
hipDeviceProp_t has an integrated field (int integrated; ///< APU vs dGPU) that is correctly set to 1 for AMD APU iGPUs. The Jetson Orin Nano corruption (#15034) is CUDA-specific. It stems from a bug in the UMA host-buffer allocation
path on that device. That path is guarded by the integrated flag. I believe there is no evidence the same corruption affects HIP/ROCm builds.
Restore prop.integrated for HIP builds only:
// ggml-cuda.cu line 249
#if defined(GGML_USE_HIP)
info.devices[id].integrated = prop.integrated;
#else
info.devices[id].integrated = false; // Temporarily disabled due to issues with corrupted output (e.g. #15034)
#endif
Should any other information and clarifications be necessary, or if this change wouldn't work, please let me know. Just as a note, I built llama.cpp from source with the proposed change above and the fix worked.
First Bad Commit
PR #16308
Relevant log output
The crash would manifest as a segfault during warmup when an AMD APU (iGPU + dGPU) system is present, both devices get classified as discrete GPUs. llama.cpp splits KV cache across them via pipeline parallelism and crashes. You can reference the symptoms from the linked issues (lemonade-sdk/llamacpp-rocm#96 and ROCm/ROCm#6227).
Name and Version
version: b9453
Operating systems
Windows, Linux
Which llama.cpp modules do you know to be affected?
CUDA/HIP module
Command line
N/A
Problem description & steps to reproduce
Problem
PR #16308 hardcoded
info.devices[id].integrated = falsefor all CUDA/HIP devices to work around corrupted output on Nvidia Jetson Orin Nano (#15034). This is correct for CUDA builds, but it has two unintended side effects for HIP/ROCm builds:1.
supports_buft()reads stalefalsefor AMD APU iGPUs (have dGPU)ggml_backend_cuda_device_supports_buft()(line 5437) reads fromggml_cuda_info().devices[dev_ctx->device].integrated, which is alwaysfalsedue to the hardcode from PR #16308. This means AMD APU iGPUs never report host buffer support, forcing the discrete-GPU allocation path even on UMA hardware.PR #23007 fixed
get_type()by queryingprop.integrateddirectly fromhipGetDeviceProperties(). This bypassed the cached field, so device classification is now correct. Butsupports_buft()was not updated and still readsfalse. This leaves two sources of truth in the same file that can contradict each other for the same device.2. Impact on APU-only systems (no dGPU)
On a system with only an AMD APU (no discrete GPU), the iGPU is the only compute device. After PR #23007 it gets correctly classified as
GGML_BACKEND_DEVICE_TYPE_IGPUand added to the device list. Butsupports_buft()still returnsfalsefor host buffers, forcing incorrect allocation strategy on a UMA device.Root Cause & Proposed Change
hipDeviceProp_thas anintegratedfield (int integrated; ///< APU vs dGPU) that is correctly set to1for AMD APU iGPUs. The Jetson Orin Nano corruption (#15034) is CUDA-specific. It stems from a bug in the UMA host-buffer allocationpath on that device. That path is guarded by the
integratedflag. I believe there is no evidence the same corruption affects HIP/ROCm builds.Restore
prop.integratedfor HIP builds only:Should any other information and clarifications be necessary, or if this change wouldn't work, please let me know. Just as a note, I built llama.cpp from source with the proposed change above and the fix worked.
First Bad Commit
PR #16308
Relevant log output
The crash would manifest as a segfault during warmup when an AMD APU (iGPU + dGPU) system is present, both devices get classified as discrete GPUs. llama.cpp splits KV cache across them via pipeline parallelism and crashes. You can reference the symptoms from the linked issues (lemonade-sdk/llamacpp-rocm#96 and ROCm/ROCm#6227).