Skip to content

fix(server): restore ROCm gfx family normalization for dGPU detection#2324

Closed
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/rocm-gfx-family-detection
Closed

fix(server): restore ROCm gfx family normalization for dGPU detection#2324
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/rocm-gfx-family-detection

Conversation

@ianbmacdonald

Copy link
Copy Markdown
Collaborator

Fixes #2319.

Problem

identify_rocm_arch_from_name() returns the specific gfx arch (e.g. gfx1100, gfx1201) from the gfx-regex and KFD numeric-ISA detection paths, instead of the ROCm family download target (gfx110X, gfx120X) that the backend support set expects. Commit 2a7aa18 (#2093) removed ROCM_ARCH_MAPPING, which regressed ROCm availability for every RDNA2/3/4 dGPU detected via those two paths — the marketing-name heuristics still return families, so only the gfx-string / KFD-numeric path broke.

On Linux, KFD hands the ISA through as a number (e.g. 110000), so an RX 7900 XT is reported as gfx1100, which is not in the support set {gfx103X, gfx110X, gfx120X, ...}Unsupported GPU: gfx1100.

Fix

Restore the specific→family normalization in a small header-only helper (lemon/rocm_arch.h) applied to both regressed return paths, mirroring the removed ROCM_ARCH_MAPPING:

  • gfx1030-gfx1034 → gfx103X, gfx1100-gfx1103 → gfx110X, gfx1200/gfx1201 → gfx120X
  • gfx115x iGPUs (gfx1150/1151/1152) pass through as exact targets
  • the unconfirmed RDNA2 ISAs gfx1033/1035/1036 return "" (unsupported), matching the prior mapping

A standalone unit test (test/cpp/test_rocm_arch.cpp, CTest RocmArchTest) covers all families, the iGPU pass-through, the unsupported ISAs, and idempotency.

Validation

Unit: RocmArchTest — 19/19 pass.
Build: lemonade-server-core + lemond build clean (Ninja/GCC).
Real hardware (RX 7900 XT, gfx1100): built lemond from this branch, ran it on the host, queried /api/v1/system-info:

before (10.8.0) after (this branch)
detected family gfx1100 gfx110X
rocm backend status Unsupported GPU: gfx1100 Backend is supported but not installed (all recipes)

Test environment

  • lemonade: 10.8.0 dev build from this branch (20e9cbf15), not a release
  • OS: Ubuntu 26.04 LTS · kernel: 7.0.0-22-generic · arch: x86_64
  • GPU: AMD Navi 31 [Radeon RX 7900 XT] [1002:744c] (rev c8) — gfx1100, RDNA3 (gfx110X family), 24 GB VRAM
  • ROCm: 7.2.4 present (/opt/rocm-7.2.4) but not exercised — this is a GPU-detection fix; validation is via /system-info family + backend-availability status, no ROCm inference run (the rocm backend is not installed on this host, which is exactly why the "supported but not installed" status is the correct post-fix result)
  • backend exercised: none (detection / system-info only)

@github-actions github-actions Bot added bug Something isn't working runtime::rocm AMD ROCm runtime labels Jun 20, 2026
@ianbmacdonald ianbmacdonald force-pushed the fix/rocm-gfx-family-detection branch from 20e9cbf to a71385e Compare June 20, 2026 00:52
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Amended (force-pushed) after a closer look at the package metadata: the helper now collapses the full `gfx1030–\gfx1036` range to `gfx103X`, not just `gfx1030/1031/1032/1034`.

backend_versions.json url_mapping publishes gfx103X-all for all of gfx1030gfx1036, so excluding gfx1033/1035/1036 (which an earlier version did, inherited from the removed ROCM_ARCH_MAPPING) would have left those exact IDs unsupported despite a bundle existing — the same class of bug as this issue. The unit test now covers the full range (22 cases). Commit message + header comment updated to match.

identify_rocm_arch_from_name() returned the specific gfx arch (e.g. gfx1100,
gfx1201) from the gfx-regex and KFD numeric-ISA detection paths instead of the
ROCm family download target (gfx110X, gfx120X) the backend support set expects.
Commit 2a7aa18 (lemonade-sdk#2093) removed ROCM_ARCH_MAPPING, which regressed ROCm
availability for every RDNA2/3/4 dGPU detected via those paths: the server
reported "Unsupported GPU: gfx1100" for e.g. an RX 7900 XT.

Restore the specific->family normalization in a small header-only helper
(lemon/rocm_arch.h) applied to both regressed paths:
  gfx1030-gfx1036 -> gfx103X, gfx1100-gfx1103 -> gfx110X, gfx1200/gfx1201 -> gfx120X
The RDNA2 range mirrors backend_versions.json url_mapping (all gfx1030-gfx1036
map to the published gfx103X-all archive). gfx115x iGPUs and CDNA targets pass
through as exact package IDs. Add a standalone unit test (CTest RocmArchTest).

Fixes lemonade-sdk#2319

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: GLM-5.2 <noreply@zhipuai.cn>
Co-Authored-By: GPT-5.5 <noreply@openai.com>
@ianbmacdonald ianbmacdonald force-pushed the fix/rocm-gfx-family-detection branch from a71385e to 07ab2ba Compare June 20, 2026 02:10
@jtlayton

Copy link
Copy Markdown

This looks like a more complex version of the fix here: #2295

This one will need to be updated with the addition of new GPUs in the same family. Is that a good thing? I'm not sure.

@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #2295, which fixes the same regression (#2319) with a smaller, data-driven approach — a trailing-X wildcard in device_matches_constraint so gfx110X matches gfx1100gfx1103 without enumerating each arch. That's the more maintainable path (no code change when a new same-family GPU appears), and @jtlayton rightly pointed out here that the explicit normalization in this PR would need updating per new GPU in a family.

One data point in case it's useful for #2295: I reproduced #2319 on an RX 7900 XT (gfx1100, ROCm 7.2.4, Ubuntu 26.04), where lemonade backends reported Unsupported GPU: gfx1100, and confirmed that restoring family matching brings ROCm detection back end-to-end — the download target also resolves, since url_mapping in backend_versions.json carries the specific gfx1100gfx1103 keys alongside the gfx110X family key. Thanks @jtlayton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working runtime::rocm AMD ROCm runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ROCm unavailable for gfx103X, gfx110X and gfx120X

2 participants