fix(server): restore ROCm gfx family normalization for dGPU detection#2324
fix(server): restore ROCm gfx family normalization for dGPU detection#2324ianbmacdonald wants to merge 1 commit into
Conversation
20e9cbf to
a71385e
Compare
|
Amended (force-pushed) after a closer look at the package metadata: the helper now collapses the full `gfx1030
|
identify_rocm_arch_from_name() returned the specific gfx arch (e.g. gfx1100, gfx1201) from the gfx-regex and KFD numeric-ISA detection paths instead of the ROCm family download target (gfx110X, gfx120X) the backend support set expects. Commit 2a7aa18 (lemonade-sdk#2093) removed ROCM_ARCH_MAPPING, which regressed ROCm availability for every RDNA2/3/4 dGPU detected via those paths: the server reported "Unsupported GPU: gfx1100" for e.g. an RX 7900 XT. Restore the specific->family normalization in a small header-only helper (lemon/rocm_arch.h) applied to both regressed paths: gfx1030-gfx1036 -> gfx103X, gfx1100-gfx1103 -> gfx110X, gfx1200/gfx1201 -> gfx120X The RDNA2 range mirrors backend_versions.json url_mapping (all gfx1030-gfx1036 map to the published gfx103X-all archive). gfx115x iGPUs and CDNA targets pass through as exact package IDs. Add a standalone unit test (CTest RocmArchTest). Fixes lemonade-sdk#2319 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: GLM-5.2 <noreply@zhipuai.cn> Co-Authored-By: GPT-5.5 <noreply@openai.com>
a71385e to
07ab2ba
Compare
|
This looks like a more complex version of the fix here: #2295 This one will need to be updated with the addition of new GPUs in the same family. Is that a good thing? I'm not sure. |
|
Closing in favor of #2295, which fixes the same regression (#2319) with a smaller, data-driven approach — a trailing- One data point in case it's useful for #2295: I reproduced #2319 on an RX 7900 XT (gfx1100, ROCm 7.2.4, Ubuntu 26.04), where |
Fixes #2319.
Problem
identify_rocm_arch_from_name()returns the specific gfx arch (e.g.gfx1100,gfx1201) from the gfx-regex and KFD numeric-ISA detection paths, instead of the ROCm family download target (gfx110X,gfx120X) that the backend support set expects. Commit 2a7aa18 (#2093) removedROCM_ARCH_MAPPING, which regressed ROCm availability for every RDNA2/3/4 dGPU detected via those two paths — the marketing-name heuristics still return families, so only the gfx-string / KFD-numeric path broke.On Linux, KFD hands the ISA through as a number (e.g.
110000), so an RX 7900 XT is reported asgfx1100, which is not in the support set{gfx103X, gfx110X, gfx120X, ...}→Unsupported GPU: gfx1100.Fix
Restore the specific→family normalization in a small header-only helper (
lemon/rocm_arch.h) applied to both regressed return paths, mirroring the removedROCM_ARCH_MAPPING:gfx1030-gfx1034 → gfx103X,gfx1100-gfx1103 → gfx110X,gfx1200/gfx1201 → gfx120Xgfx1150/1151/1152) pass through as exact targetsgfx1033/1035/1036return""(unsupported), matching the prior mappingA standalone unit test (
test/cpp/test_rocm_arch.cpp, CTestRocmArchTest) covers all families, the iGPU pass-through, the unsupported ISAs, and idempotency.Validation
Unit:
RocmArchTest— 19/19 pass.Build:
lemonade-server-core+lemondbuild clean (Ninja/GCC).Real hardware (RX 7900 XT, gfx1100): built
lemondfrom this branch, ran it on the host, queried/api/v1/system-info:familygfx1100gfx110XUnsupported GPU: gfx1100Backend is supported but not installed(all recipes)Test environment
20e9cbf15), not a release7.0.0-22-generic· arch: x86_64[1002:744c](rev c8) — gfx1100, RDNA3 (gfx110Xfamily), 24 GB VRAM/opt/rocm-7.2.4) but not exercised — this is a GPU-detection fix; validation is via/system-infofamily + backend-availability status, no ROCm inference run (the rocm backend is not installed on this host, which is exactly why the "supported but not installed" status is the correct post-fix result)