`rocm-doctor` skill [Phase 0] by danielholanda · Pull Request #37 · amd/skills

danielholanda · 2026-05-25T22:54:41Z

Important

This Skill still needs to go through tests and design iterations before published. We are currently looking for an owner from the ROCm team to own this.

Note: This description has been copied from the proposal issue.

`rocm-doctor` skill

Skill metadata

Field	Value
Proposed name	`rocm-doctor`
Already exists?	No, new
Catalog area	Hardware-native skills
Location	Path A (incubated here; may graduate to a product repo later)

One-sentence outcome

Given a user complaint of the form "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU", identify which of a fixed list of known misconfigurations is the cause and either fix it or hand the user a precise next step.

Scope

In scope

Detect the user's setup: OS / distro / kernel, GPU gfx target, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).
Identify a fixed list of known misconfigurations (see next section) and either fix them or hand back exact next steps.
Apply low-risk fixes on the user's behalf with consent (group membership, environment variables, PATH, reinstall a wheel from the
right index, etc.).
Route the user to the correct upstream channel when the issue is owned by another project (Lemonade, LM Studio, Ollama, vLLM, ...).

Out of scope

Deep-diving into internal ROCm bugs (kernel crashes, compiler ICEs, miscompiled kernels). Those belong on the relevant ROCm repo.
Performance tuning. That's mi-tuner, omniperf-tune, apu-memory-tuner.
Porting code (CUDA to HIP). That's cuda-to-hip.
Holding a copy of any support matrix in the skill body. Matrices go stale
fast; always fetch live from rocm.docs.amd.com at run time.
Touching BIOS, flashing firmware, or auto-rebooting.
Installing ROCm from scratch on a clean machine. That is a setup task,
not a doctor task; point at the official installer instead.

Device support, phased

Phase	GPUs
0	Ryzen AI APUs (Strix Halo, Strix Point, Krackan, Phoenix, Hawk Point)
1	Instinct (Latest MI GPUs)
2	Radeon dGPUs

Framework support, phased

Framework	Phase
PyTorch	0
llama.cpp	0
Lemonade	0
LM Studio	0
Ollama	0
vLLM	1+
SGLang	1+

High-level flow

[ ] 1. Identify the framework (ask if unclear).
[ ] 2. Decide whether system examination is needed for that framework.
       - Lemonade / LM Studio / Ollama -> skip, route to upstream.
       - Everything else -> continue.
[ ] 3. Run system examination (read-only).
[ ] 4. Match the symptoms against the known-misconfiguration list.
[ ] 5. Propose the fix; ask before applying anything that changes state.
[ ] 6. Re-verify after the fix; mark complete only when the originally
       failing command now succeeds.

Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.

What the system examination collects

OS name, distro, kernel version.
AMD GPU(s) present, each one's gfx target.
amdgpu kernel module status, /dev/kfd ownership and mode, user's
group memberships.
System ROCm version (if any) and how it was installed (amdgpu-install, distro packages, pip wheel, none).
Framework version (PyTorch's torch.version.hip, llama.cpp build info, etc.) and which ROCm it links against.
Relevant environment variables: HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, ROCM_PATH, PYTORCH_ROCM_ARCH, LD_LIBRARY_PATH.

This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.

Known misconfigurations to catch

A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.

GPU gfx target not in the framework's build arch list.
HSA_OVERRIDE_GFX_VERSION set on a GPU that now has a native wheel.
ROCm version, Linux distro, and kernel form an unsupported triple.
User not in render / video groups, or /dev/kfd owned by the
other group.
amdgpu kernel module not loaded or blacklisted.
ROCm binaries not on PATH after install.
Stale or conflicting APT / DNF repos from prior installer runs.
Framework wheel built for a different ROCm major than the system has.
iGPU enumerated alongside a dGPU and crashing the runtime.
Container can't see /dev/kfd or /dev/dri/renderD*.
Multi-GPU hang on systems with IOMMU enabled.
amdgpu-install left a broken state (e.g. --accept-eula repo
regression, partial DKMS).

Adjacent problem: ROCm facts live in hand-written tables

Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.

The real fix is bigger than this skill: ROCm wants a single, agent-friendly source of truth that feeds both the docs and skills like rocm-doctor. Until that exists, we scrape rocm.docs.amd.com at run time. It's a workaround, not the endgame.

Why a skill, not a doc?

The information needed to debug a ROCm install is mostly already on rocm.docs.amd.com and in dozens of GitHub issues. What's missing is the opinionated decision tree a senior AMD engineer runs in their head: which check to run first, which fix to propose, which questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.

danielholanda added 4 commits May 25, 2026 14:55

Rocm-doctor v1

7c6bec5

Cleaner

67f04d6

Add Windows to scope

042ff80

Documentation

3a7fee6

danielholanda self-assigned this May 25, 2026

Concise description

67ab9db

danielholanda merged commit 217cb8e into main May 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`rocm-doctor` skill [Phase 0]#37

`rocm-doctor` skill [Phase 0]#37
danielholanda merged 5 commits into
mainfrom
dholanda/rocm

danielholanda commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielholanda commented May 25, 2026

rocm-doctor skill

Skill metadata

One-sentence outcome

Scope

In scope

Out of scope

Device support, phased

Framework support, phased

High-level flow

What the system examination collects

Known misconfigurations to catch

Adjacent problem: ROCm facts live in hand-written tables

Why a skill, not a doc?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`rocm-doctor` skill