Skip to content

rocm-doctor skill [Phase 0]#37

Merged
danielholanda merged 5 commits into
mainfrom
dholanda/rocm
May 25, 2026
Merged

rocm-doctor skill [Phase 0]#37
danielholanda merged 5 commits into
mainfrom
dholanda/rocm

Conversation

@danielholanda

Copy link
Copy Markdown
Collaborator

Important

This Skill still needs to go through tests and design iterations before published. We are currently looking for an owner from the ROCm team to own this.

Note: This description has been copied from the proposal issue.

rocm-doctor skill

Skill metadata

Field Value
Proposed name rocm-doctor
Already exists? No, new
Catalog area Hardware-native skills
Location Path A (incubated here; may graduate to a product repo later)

One-sentence outcome

Given a user complaint of the form "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU", identify which of a fixed list of known misconfigurations is the cause and either fix it or hand the user a precise next step.

Scope

In scope

  • Detect the user's setup: OS / distro / kernel, GPU gfx target, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).
  • Identify a fixed list of known misconfigurations (see next section) and either fix them or hand back exact next steps.
  • Apply low-risk fixes on the user's behalf with consent (group membership, environment variables, PATH, reinstall a wheel from the
    right index, etc.).
  • Route the user to the correct upstream channel when the issue is owned by another project (Lemonade, LM Studio, Ollama, vLLM, ...).

Out of scope

  • Deep-diving into internal ROCm bugs (kernel crashes, compiler ICEs, miscompiled kernels). Those belong on the relevant ROCm repo.
  • Performance tuning. That's mi-tuner, omniperf-tune, apu-memory-tuner.
  • Porting code (CUDA to HIP). That's cuda-to-hip.
  • Holding a copy of any support matrix in the skill body. Matrices go stale
    fast; always fetch live from rocm.docs.amd.com at run time.
  • Touching BIOS, flashing firmware, or auto-rebooting.
  • Installing ROCm from scratch on a clean machine. That is a setup task,
    not a doctor task; point at the official installer instead.

Device support, phased

Phase GPUs
0 Ryzen AI APUs (Strix Halo, Strix Point, Krackan, Phoenix, Hawk Point)
1 Instinct (Latest MI GPUs)
2 Radeon dGPUs

Framework support, phased

Framework Phase
PyTorch 0
llama.cpp 0
Lemonade 0
LM Studio 0
Ollama 0
vLLM 1+
SGLang 1+

High-level flow

[ ] 1. Identify the framework (ask if unclear).
[ ] 2. Decide whether system examination is needed for that framework.
       - Lemonade / LM Studio / Ollama -> skip, route to upstream.
       - Everything else -> continue.
[ ] 3. Run system examination (read-only).
[ ] 4. Match the symptoms against the known-misconfiguration list.
[ ] 5. Propose the fix; ask before applying anything that changes state.
[ ] 6. Re-verify after the fix; mark complete only when the originally
       failing command now succeeds.

Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.

What the system examination collects

  • OS name, distro, kernel version.
  • AMD GPU(s) present, each one's gfx target.
  • amdgpu kernel module status, /dev/kfd ownership and mode, user's
    group memberships.
  • System ROCm version (if any) and how it was installed (amdgpu-install, distro packages, pip wheel, none).
  • Framework version (PyTorch's torch.version.hip, llama.cpp build info, etc.) and which ROCm it links against.
  • Relevant environment variables: HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, ROCM_PATH, PYTORCH_ROCM_ARCH, LD_LIBRARY_PATH.

This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.

Known misconfigurations to catch

A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.

  1. GPU gfx target not in the framework's build arch list.
  2. HSA_OVERRIDE_GFX_VERSION set on a GPU that now has a native wheel.
  3. ROCm version, Linux distro, and kernel form an unsupported triple.
  4. User not in render / video groups, or /dev/kfd owned by the
    other group.
  5. amdgpu kernel module not loaded or blacklisted.
  6. ROCm binaries not on PATH after install.
  7. Stale or conflicting APT / DNF repos from prior installer runs.
  8. Framework wheel built for a different ROCm major than the system has.
  9. iGPU enumerated alongside a dGPU and crashing the runtime.
  10. Container can't see /dev/kfd or /dev/dri/renderD*.
  11. Multi-GPU hang on systems with IOMMU enabled.
  12. amdgpu-install left a broken state (e.g. --accept-eula repo
    regression, partial DKMS).

Adjacent problem: ROCm facts live in hand-written tables

Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.

The real fix is bigger than this skill: ROCm wants a single, agent-friendly source of truth that feeds both the docs and skills like rocm-doctor. Until that exists, we scrape rocm.docs.amd.com at run time. It's a workaround, not the endgame.

Why a skill, not a doc?

The information needed to debug a ROCm install is mostly already on rocm.docs.amd.com and in dozens of GitHub issues. What's missing is the opinionated decision tree a senior AMD engineer runs in their head: which check to run first, which fix to propose, which questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.

@danielholanda danielholanda self-assigned this May 25, 2026
@danielholanda danielholanda merged commit 217cb8e into main May 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant