rocm-doctor skill
Skill metadata
| Field |
Value |
| Proposed name |
rocm-doctor |
| Already exists? |
No, new |
| Catalog area |
Hardware-native skills |
| Location |
Path A (incubated here; may graduate to a product repo later) |
One-sentence outcome
Given a user complaint of the form "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU", identify which of a fixed list of known misconfigurations is the cause and either fix it or hand the user a precise next step.
Scope
In scope
- Detect the user's setup: OS / distro / kernel, GPU
gfx target, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).
- Identify a fixed list of known misconfigurations (see next section) and either fix them or hand back exact next steps.
- Apply low-risk fixes on the user's behalf with consent (group membership, environment variables,
PATH, reinstall a wheel from the
right index, etc.).
- Route the user to the correct upstream channel when the issue is owned by another project (Lemonade, LM Studio, Ollama, vLLM, ...).
Out of scope
- Deep-diving into internal ROCm bugs (kernel crashes, compiler ICEs, miscompiled kernels). Those belong on the relevant ROCm repo.
- Performance tuning. That's
mi-tuner, omniperf-tune, apu-memory-tuner.
- Porting code (CUDA to HIP). That's
cuda-to-hip.
- Holding a copy of any support matrix in the skill body. Matrices go stale
fast; always fetch live from rocm.docs.amd.com at run time.
- Touching BIOS, flashing firmware, or auto-rebooting.
- Installing ROCm from scratch on a clean machine. That is a setup task,
not a doctor task; point at the official installer instead.
Device support, phased
| Phase |
GPUs |
| 0 |
Ryzen AI APUs (Strix Halo, Strix Point, Krackan, Phoenix, Hawk Point) |
| 1 |
Instinct (Latest MI GPUs) |
| 2 |
Radeon dGPUs |
Framework support, phased
| Framework |
Phase |
| PyTorch |
0 |
| llama.cpp |
0 |
| Lemonade |
0 |
| LM Studio |
0 |
| Ollama |
0 |
| vLLM |
1+ |
| SGLang |
1+ |
High-level flow
[ ] 1. Identify the framework (ask if unclear).
[ ] 2. Decide whether system examination is needed for that framework.
- Lemonade / LM Studio / Ollama -> skip, route to upstream.
- Everything else -> continue.
[ ] 3. Run system examination (read-only).
[ ] 4. Match the symptoms against the known-misconfiguration list.
[ ] 5. Propose the fix; ask before applying anything that changes state.
[ ] 6. Re-verify after the fix; mark complete only when the originally
failing command now succeeds.
Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.
What the system examination collects
- OS name, distro, kernel version.
- AMD GPU(s) present, each one's
gfx target.
amdgpu kernel module status, /dev/kfd ownership and mode, user's
group memberships.
- System ROCm version (if any) and how it was installed (
amdgpu-install, distro packages, pip wheel, none).
- Framework version (PyTorch's
torch.version.hip, llama.cpp build info, etc.) and which ROCm it links against.
- Relevant environment variables:
HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, ROCM_PATH, PYTORCH_ROCM_ARCH, LD_LIBRARY_PATH.
This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.
Known misconfigurations to catch
A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.
- GPU
gfx target not in the framework's build arch list.
HSA_OVERRIDE_GFX_VERSION set on a GPU that now has a native wheel.
- ROCm version, Linux distro, and kernel form an unsupported triple.
- User not in
render / video groups, or /dev/kfd owned by the
other group.
amdgpu kernel module not loaded or blacklisted.
- ROCm binaries not on
PATH after install.
- Stale or conflicting APT / DNF repos from prior installer runs.
- Framework wheel built for a different ROCm major than the system has.
- iGPU enumerated alongside a dGPU and crashing the runtime.
- Container can't see
/dev/kfd or /dev/dri/renderD*.
- Multi-GPU hang on systems with IOMMU enabled.
amdgpu-install left a broken state (e.g. --accept-eula repo
regression, partial DKMS).
Adjacent problem: ROCm facts live in hand-written tables
Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.
The real fix is bigger than this skill: ROCm wants a single, agent-friendly source of truth that feeds both the docs and skills like rocm-doctor. Until that exists, we scrape rocm.docs.amd.com at run time. It's a workaround, not the endgame.
Why a skill, not a doc?
The information needed to debug a ROCm install is mostly already on rocm.docs.amd.com and in dozens of GitHub issues. What's missing is the opinionated decision tree a senior AMD engineer runs in their head: which check to run first, which fix to propose, which questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.
rocm-doctorskillSkill metadata
rocm-doctorOne-sentence outcome
Given a user complaint of the form "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU", identify which of a fixed list of known misconfigurations is the cause and either fix it or hand the user a precise next step.
Scope
In scope
gfxtarget, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).PATH, reinstall a wheel from theright index, etc.).
Out of scope
mi-tuner,omniperf-tune,apu-memory-tuner.cuda-to-hip.fast; always fetch live from
rocm.docs.amd.comat run time.not a doctor task; point at the official installer instead.
Device support, phased
Framework support, phased
High-level flow
Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.
What the system examination collects
gfxtarget.amdgpukernel module status,/dev/kfdownership and mode, user'sgroup memberships.
amdgpu-install, distro packages, pip wheel, none).torch.version.hip, llama.cpp build info, etc.) and which ROCm it links against.HSA_OVERRIDE_GFX_VERSION,HIP_VISIBLE_DEVICES,ROCM_PATH,PYTORCH_ROCM_ARCH,LD_LIBRARY_PATH.This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.
Known misconfigurations to catch
A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.
gfxtarget not in the framework's build arch list.HSA_OVERRIDE_GFX_VERSIONset on a GPU that now has a native wheel.render/videogroups, or/dev/kfdowned by theother group.
amdgpukernel module not loaded or blacklisted.PATHafter install./dev/kfdor/dev/dri/renderD*.amdgpu-installleft a broken state (e.g.--accept-eulareporegression, partial DKMS).
Adjacent problem: ROCm facts live in hand-written tables
Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.
The real fix is bigger than this skill: ROCm wants a single, agent-friendly source of truth that feeds both the docs and skills like
rocm-doctor. Until that exists, we scraperocm.docs.amd.comat run time. It's a workaround, not the endgame.Why a skill, not a doc?
The information needed to debug a ROCm install is mostly already on rocm.docs.amd.com and in dozens of GitHub issues. What's missing is the opinionated decision tree a senior AMD engineer runs in their head: which check to run first, which fix to propose, which questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.