Skip to content

Misc. bug: ggml-cuda: restore prop.integrated for HIP builds — #16308 hardcode breaks iGPU classification and supports_buft for AMD APUs #23977

@mapatel-amd

Description

@mapatel-amd

Name and Version

version: b9453

Operating systems

Windows, Linux

Which llama.cpp modules do you know to be affected?

CUDA/HIP module

Command line

N/A

Problem description & steps to reproduce

Problem

PR #16308 hardcoded info.devices[id].integrated = false for all CUDA/HIP devices to work around corrupted output on Nvidia Jetson Orin Nano (#15034). This is correct for CUDA builds, but it has two unintended side effects for HIP/ROCm builds:

1. supports_buft() reads stale false for AMD APU iGPUs (have dGPU)

ggml_backend_cuda_device_supports_buft() (line 5437) reads from ggml_cuda_info().devices[dev_ctx->device].integrated, which is always false due to the hardcode from PR #16308. This means AMD APU iGPUs never report host buffer support, forcing the discrete-GPU allocation path even on UMA hardware.

PR #23007 fixed get_type() by querying prop.integrated directly from hipGetDeviceProperties(). This bypassed the cached field, so device classification is now correct. But supports_buft() was not updated and still reads false. This leaves two sources of truth in the same file that can contradict each other for the same device.

2. Impact on APU-only systems (no dGPU)

On a system with only an AMD APU (no discrete GPU), the iGPU is the only compute device. After PR #23007 it gets correctly classified as GGML_BACKEND_DEVICE_TYPE_IGPU and added to the device list. But supports_buft() still returns false for host buffers, forcing incorrect allocation strategy on a UMA device.

Root Cause & Proposed Change

hipDeviceProp_t has an integrated field (int integrated; ///< APU vs dGPU) that is correctly set to 1 for AMD APU iGPUs. The Jetson Orin Nano corruption (#15034) is CUDA-specific. It stems from a bug in the UMA host-buffer allocation
path on that device. That path is guarded by the integrated flag. I believe there is no evidence the same corruption affects HIP/ROCm builds.

Restore prop.integrated for HIP builds only:

// ggml-cuda.cu line 249
#if defined(GGML_USE_HIP)
        info.devices[id].integrated = prop.integrated;
#else
        info.devices[id].integrated = false; // Temporarily disabled due to issues with corrupted output (e.g. #15034)
#endif

Should any other information and clarifications be necessary, or if this change wouldn't work, please let me know. Just as a note, I built llama.cpp from source with the proposed change above and the fix worked.

First Bad Commit

PR #16308

Relevant log output

The crash would manifest as a segfault during warmup when an AMD APU (iGPU + dGPU) system is present, both devices get classified as discrete GPUs. llama.cpp splits KV cache across them via pipeline parallelism and crashes. You can reference the symptoms from the linked issues (lemonade-sdk/llamacpp-rocm#96 and ROCm/ROCm#6227).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions