ROCm: Memory calculation fails to take `HIP_VISIBLE_DEVICES` into account

I have a ROCm compiled with support for both the discrete GPU and the iGPU, but with `HIP_VISIBLE_DEVICES` set to `0` to ensure only the discrete GPU is considered (the iGPU is just for experimenting, it's far too slow to meaningfully use). But because `rocminfo` and `rocm-smi` list both GPUs, setting `gpulayers` to `-1` to trigger the automatic calculation uses the reported iGPU memory. This is wrong/suboptimal in two respects:
1. The "reported memory" for the iGPU takes only the reserved GPU memory into account, but compute memory actually ends up as GTT memory, meaning all of system memory is available rather than the few piddly MB for video graphics (from Linux 6.10 onwards at least, but I've also observed this happening on 6.9). To get any kind of speed out of this you need to compile GGML with `GGML_HIP_UMA`, though, which tanks perf on discrete GPUs -- but I digress.
2. With `HIP_VISIBLE_DEVICES` set, the iGPU is actually not used for inference at all, yet its lower memory is used, causing the calculation to always offload 0 layers.

If I ugly hack my local koboldcpp.py to simply ignore any devices beyond the first the auto-layer calculation does its job correctly. I'm too lazy to write a real patch fixing the problem, I just wanted to mention it here. Taking `HIP_VISIBLE_DEVICES` (and/or `CUDA_VISIBLE_DEVICES`, which AMD supports for compatibility) should not be particularly hard. Taking the iGPU reported memory thing into account is probably too complicated (AFAIK there isn't even a reliable way to *detect* it's an iGPU; things like the marketing name being "AMD Radeon Graphics" and the uuid being "GPU-XX" are certainly *suggestive*, but hardly convincing). If you're using an iGPU (or God forbid a combination) you probably don't want to rely on the auto-layer feature anyway.

I don't know if Nvidia would have similar issues with `CUDA_VISIBLE_DEVICES` active.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm: Memory calculation fails to take `HIP_VISIBLE_DEVICES` into account #1104

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ROCm: Memory calculation fails to take HIP_VISIBLE_DEVICES into account #1104

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

ROCm: Memory calculation fails to take `HIP_VISIBLE_DEVICES` into account #1104