Skip to content

Commit 217cb8e

Browse files
Merge pull request #37 from amd/dholanda/rocm
`rocm-doctor` skill [Phase 0]
2 parents 35c381a + 67ab9db commit 217cb8e

7 files changed

Lines changed: 4234 additions & 1 deletion

File tree

.claude-plugin/marketplace.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,12 @@
2525
"source": "./skills/local-ai-use",
2626
"skills": "./",
2727
"description": "Route image generation, text-to-speech, and speech-to-text through a local AI Server to reduce token/cost usage."
28+
},
29+
{
30+
"name": "rocm-doctor",
31+
"source": "./skills/rocm-doctor",
32+
"skills": "./",
33+
"description": "Diagnose why ROCm, PyTorch, or llama.cpp isn't working on an AMD GPU. Matches the symptom against a fixed list of twelve known misconfigurations and proposes the next step."
2834
}
2935
]
3036
}

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for step-by-step instructions, the full a
200200

201201
## Status
202202

203-
This repository is in its early days. In-repo skills include `skills/local-ai-app-integration/` and `skills/local-ai-use/`, seeding the **Application integration** focus area, and `skills/apu-memory-tuner/`, seeding the **Hardware-native** focus area. The remaining skills are being built out incrementally alongside manifests and CI. Expect rapid iteration.
203+
This repository is in its early days. In-repo skills include `skills/local-ai-app-integration/` and `skills/local-ai-use/`, seeding the **Application integration** focus area, and `skills/apu-memory-tuner/` and `skills/rocm-doctor/`, seeding the **Hardware-native** focus area. The remaining skills are being built out incrementally alongside manifests and CI. Expect rapid iteration.
204204

205205
## License
206206

skills/rocm-doctor/SKILL.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
---
2+
name: rocm-doctor
3+
description: >-
4+
Diagnoses why ROCm, the HIP SDK, PyTorch, or llama.cpp is broken on an
5+
AMD GPU on Linux or Windows, and either applies a low-risk fix with
6+
consent or hands back the exact next step. Also routes Lemonade, LM
7+
Studio, and Ollama issues to the right upstream channel. Use when the
8+
user reports that ROCm or HIP isn't working, torch.cuda.is_available()
9+
is False Ryzen AI, rocminfo or hipInfo can't see the GPU,
10+
or hits hipErrorNoBinaryForGpu,
11+
HSA_STATUS_ERROR_INVALID_ISA, invalid device function, missing
12+
amdhip64_6.dll, vcruntime140_1.dll, or libamdhip64.so, cannot open
13+
/dev/kfd, ROCk module not loaded, an Adrenalin driver too old for the
14+
HIP SDK, or a ROCm wheel that doesn't recognize gfx1151, gfx1150, or
15+
gfx1103; or mentions HSA_OVERRIDE_GFX_VERSION,
16+
HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH, render-group permissions,
17+
amdgpu blacklist, Secure Boot, iGPU/dGPU collisions, or multi-GPU
18+
hangs. Do not use for non-AMD GPUs, performance
19+
tuning, or ROCm-on-WSL2.
20+
---
21+
22+
# ROCm Doctor
23+
24+
Given a "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU" complaint,
25+
identify which **known misconfiguration** is the cause and either fix it
26+
or hand back the exact next step.
27+
28+
This is a diagnose-and-fix skill, not a setup or tuning skill. The
29+
catalog of failure modes is a **closed list** that lives in
30+
`reference.md` and `scripts/diagnose.py`: if the user's symptom doesn't
31+
match one of them, the skill explicitly routes upstream rather than
32+
guessing. New failure modes get added by editing the catalog, not by
33+
the agent inventing them at runtime.
34+
35+
## When to use this skill
36+
37+
Use it when **any** of the following are true:
38+
39+
- The user has an **AMD** GPU and a functional error with **PyTorch**,
40+
**llama.cpp**, or anything else built directly against the system ROCm
41+
(`/opt/rocm` or a pip wheel that bundles HIP). The skill examines the
42+
host and diagnoses against the catalog.
43+
- The user is on **Lemonade**, **LM Studio**, or **Ollama**. These apps
44+
ship their own ROCm and don't need a host-level examination, but the
45+
user often doesn't know *where* to report the problem -- the skill
46+
knows the right upstream channel for each (see
47+
[Framework routing](#framework-routing)) and hands it over.
48+
49+
Out of scope:
50+
51+
- NVIDIA / Intel / Apple Silicon GPUs. Exit cleanly and tell the user.
52+
- Fresh installs on a clean machine. That's a setup task; point at
53+
[`amdgpu-install`](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html)
54+
(Linux) or the [HIP SDK installer](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html)
55+
(Windows).
56+
- Pure performance complaints. Those belong in `mi-tuner` /
57+
`omniperf-tune` / `apu-memory-tuner`.
58+
- **WSL2** (running Linux on top of Windows). The ROCm-on-WSL flow needs
59+
Adrenalin Pro plus the WSL kernel update on the Windows host -- those
60+
failure modes are not in this catalog. `examine.py` detects WSL via
61+
`/proc/version` and exits 2 with a route-out message; if the user wants
62+
WSL specifically, point them at <https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/docs/install/installryz/wsl/howto_wsl.html>.
63+
64+
## Prerequisites
65+
66+
- **OS:** Linux **or** Windows (native). The catalog has 12 Linux entries
67+
(5 of which are also valid on Windows) and 3 Windows-only entries; the
68+
scripts pick the right subset for the host they run on.
69+
- **Linux tools the agent will invoke as part of examination** (best-effort;
70+
the script degrades when one is missing):
71+
- `lspci` (always present on desktop distros)
72+
- `rocminfo` (when ROCm is installed)
73+
- `journalctl` or `dmesg` (for amdgpu kernel-ring evidence)
74+
- `python` / `python3` to introspect PyTorch
75+
- `llama-cli` / `llama-server` / `main` to introspect llama.cpp
76+
- **Windows tools the agent will invoke as part of examination**:
77+
- `powershell` (always present on Windows 10+) for `Get-CimInstance
78+
Win32_VideoController` / `Win32_Processor` and the env-scope reads.
79+
- `hipInfo.exe` from `%HIP_PATH%\bin` -- the Windows analog of `rocminfo`.
80+
Absence is itself a signal (see `fix-13-hip-sdk-missing`).
81+
- `setx` for env-var persistence and User-PATH edits (analog of editing
82+
`~/.bashrc` on Linux).
83+
- `python` to introspect PyTorch.
84+
- **Permissions:** examination is fully read-only and works as a regular
85+
user on both OSes. Linux fixes that need `sudo` are flagged in their
86+
recipe metadata; Windows fixes that touch the Machine env scope are
87+
flagged similarly and `apply_fix.py` does NOT self-elevate -- the user
88+
has to run an Administrator PowerShell when those are required.
89+
90+
Silent footguns to surface explicitly when relevant:
91+
92+
- `HSA_OVERRIDE_GFX_VERSION` -- forcing an unsupported gfx target works
93+
for `rocminfo`/`hipInfo` but causes page faults at runtime. Diagnosis
94+
`fix-2-unset-override` is the response when this is set on a GPU that
95+
already has a native wheel; on Windows it can be persisted in either
96+
the User or Machine env scope, so check both.
97+
- `HIP_VISIBLE_DEVICES` -- on dual-GPU systems (APU + dGPU) the iGPU is
98+
often index 0 and destabilises HIP unless explicitly hidden.
99+
- `HIP_PATH` (Windows) -- if the user has multiple HIP SDK versions
100+
installed under `C:\Program Files\AMD\ROCm\`, `HIP_PATH` decides which
101+
one PyTorch / hipInfo actually loads. Pointing it at the wrong major
102+
produces the same failure mode as `fix-8-wheel-rocm`.
103+
- `PYTORCH_ROCM_ARCH` -- only honored during a *build* of PyTorch. Setting
104+
it at runtime does nothing for a prebuilt wheel.
105+
- `LD_LIBRARY_PATH` (Linux) -- a wheel-bundled `libamdhip64.so` shadowed
106+
by a system one (or vice versa) gives confusing `cannot open shared
107+
object file` errors that look like fix-8 but are really a load-order
108+
bug. The Windows analog is `PATH` order: a stale HIP SDK bin directory
109+
earlier on PATH than the one matching `HIP_PATH`.
110+
111+
## The three-step flow
112+
113+
Run these in order. The first two are read-only. The third asks before
114+
changing anything.
115+
116+
```
117+
[ ] 1. Identify the framework, then examine (read-only).
118+
[ ] 2. Diagnose: match examination + symptom against the catalog.
119+
[ ] 3. Propose the fix; only apply with explicit consent; re-verify.
120+
```
121+
122+
### Step 1: identify the framework and examine
123+
124+
If the user hasn't said, ask which framework they are running. Use the
125+
`AskQuestion` tool with PyTorch / llama.cpp / Lemonade / LM Studio /
126+
Ollama / other as the options. The routing in [Framework routing](#framework-routing)
127+
keys off the answer.
128+
129+
If the framework is in the "skip examination" bucket, jump straight to
130+
the upstream link and exit. Otherwise run:
131+
132+
```bash
133+
python scripts/examine.py --framework pytorch --json > exam.json
134+
```
135+
136+
Replace `pytorch` with `llama-cpp`, or pass `--framework auto` to let the
137+
script pick. Exit codes:
138+
139+
| Exit | Meaning | Next action |
140+
|---|---|---|
141+
| 0 | Examined; AMD GPU present | Continue to Step 2. |
142+
| 2 | Wrong platform (WSL, neither Linux nor Windows, no AMD GPU) | Stop. Route the user. |
143+
| 3 | Probes partially failed | Continue but warn the user. |
144+
145+
For a quick read-only summary without piping JSON, drop `--json`:
146+
147+
```bash
148+
python scripts/examine.py --framework pytorch
149+
```
150+
151+
`examine.py` collects exactly the facts the diagnosis catalog needs.
152+
On Linux: OS / kernel, AMD GPUs and gfx targets, `amdgpu` / `amdkfd`
153+
status, `/dev/kfd` ownership and group, user's group membership, system
154+
ROCm version and install method, framework version and arch list, the
155+
silent-footgun env vars, container/IOMMU state, and recent `amdgpu`
156+
kernel log lines. On Windows: AMD adapters and gfx targets via
157+
`Win32_VideoController` + `hipInfo.exe`, the HIP SDK install path and
158+
version, the Adrenalin / kernel-mode driver version, MSVC redistributable
159+
presence, and the same env-var snapshot. It deliberately does NOT spawn
160+
heavy probes (no kernel launches, no model downloads).
161+
162+
### Step 2: diagnose
163+
164+
Hand the JSON snapshot plus the user's error text to `diagnose.py`:
165+
166+
```bash
167+
python scripts/diagnose.py --exam exam.json \
168+
--symptom "HIP error: invalid device function on gfx1151"
169+
```
170+
171+
The script runs every checker in the catalog, scores each from 0..100,
172+
and prints a ranked list. Each match has a stable `fix-N-...` id used by
173+
`apply_fix.py`.
174+
175+
Score tiers:
176+
177+
- `>= 75` (`HIGH`) -- propose the fix and (if auto-applicable) ask for
178+
consent to apply it.
179+
- `>= 50` (`LIKELY`) -- describe the match and ask the user to confirm one
180+
more piece of evidence before applying.
181+
- Below `50` -- print but do **not** act. If nothing scores `>= 50`, the
182+
script exits 1 with a single-line route to the right upstream tracker.
183+
Do not speculate.
184+
185+
JSON output (`--json`) is the same data the agent should use programmatically:
186+
187+
```bash
188+
python scripts/diagnose.py --exam exam.json --symptom "..." --json
189+
```
190+
191+
### Step 3: apply the fix (with consent)
192+
193+
Show the user the proposed fix (it's already printed by `diagnose.py`).
194+
If they consent, run:
195+
196+
```bash
197+
python scripts/apply_fix.py --fix-id fix-4-render-group --dry-run
198+
python scripts/apply_fix.py --fix-id fix-4-render-group --yes
199+
```
200+
201+
`--dry-run` prints the exact commands without executing. `--yes` skips
202+
the interactive `[y/N]` prompt (only pass this after the user has agreed
203+
in chat).
204+
205+
A subset of fixes are auto-applicable; the rest are deliberately
206+
print-only because the risk of a half-applied state is too high for an
207+
agent to take. To see which is which without consulting `reference.md`:
208+
209+
```bash
210+
python scripts/apply_fix.py --list
211+
```
212+
213+
That prints every `fix-id` with an `AUTO` or `PRINT-ONLY` tag. Auto
214+
fixes are bounded operations like unsetting an env var, adding the user
215+
to a group, or appending a single line to a shell rc. Print-only fixes
216+
involve reinstalling frameworks, editing GRUB, regenerating the
217+
initramfs, or moving system repo files; those need a human at the
218+
keyboard.
219+
220+
After every fix, re-run the `verify` command the recipe printed. Only
221+
declare success when the user's *original* failing command now succeeds
222+
(e.g. `torch.cuda.is_available()` returns `True`, `rocminfo` lists the
223+
GPU, the llama.cpp build runs).
224+
225+
## Framework routing
226+
227+
The skill's first decision is which framework the user runs. Some
228+
frameworks ship their own ROCm and bypass the system install; for those
229+
the right answer is "you're in the wrong place, here's where to file
230+
it", and the skill delivers that answer directly rather than running
231+
useless probes against the host.
232+
233+
| Framework | Examine the host? | Action |
234+
|---|---|---|
235+
| PyTorch (Linux ROCm wheel) | Yes | `python scripts/examine.py --framework pytorch`, then `diagnose.py`. |
236+
| PyTorch (Windows TheRock wheel) | Yes | Same scripts; on Windows the catalog filters to Linux+Windows + Windows-only entries. |
237+
| llama.cpp (built against system ROCm/HIP SDK) | Yes | `python scripts/examine.py --framework llama-cpp`, then `diagnose.py`. |
238+
| Lemonade | No -- ships its own ROCm | Route to <https://github.com/lemonade-sdk/lemonade/issues> and the Lemonade [Discord](https://discord.gg/5xXzkMu8Zk). |
239+
| LM Studio | No -- ships its own runtime | Route to <https://lmstudio.ai/docs/app> (in-app support; no public repo). |
240+
| Ollama | No -- ships its own runtime | Route to <https://github.com/ollama/ollama/issues> and the Ollama Discord. |
241+
| vLLM / SGLang | Out of scope until phase 1+ | Route to the project's own issue tracker. |
242+
243+
If a Lemonade / LM Studio / Ollama user *does* have a host-level ROCm
244+
problem (rare), it shows up when their app fails AND a standalone
245+
`rocminfo` (Linux) / `hipInfo.exe` (Windows) also fails. Only then
246+
escalate to the full examination.
247+
248+
## Safety rules
249+
250+
- Read-only by default. Examination and diagnosis never change state.
251+
- Always print before applying. `apply_fix.py` shows every command before
252+
asking for consent, even with `--yes`.
253+
- Never reboot, never touch BIOS, never flash firmware.
254+
- Never reinstall system packages without an interactive prompt or `--yes`.
255+
- Never set `HSA_OVERRIDE_GFX_VERSION` as the *first* fix when a native
256+
wheel exists. That is `fix-2-unset-override`'s entire reason for being.
257+
- Never silently fall back to a different fix when the requested one
258+
isn't applicable. Exit 3 and tell the user why.
259+
- When nothing in the catalog matches, **do not speculate**. Hand the
260+
user the upstream tracker URL from `diagnose.py --json`.
261+
262+
## Verification checklist
263+
264+
Mark this skill complete only when **all** are true:
265+
266+
- [ ] `python scripts/examine.py` exits 0 (or 3 with the user's explicit
267+
go-ahead to continue despite a partial probe).
268+
- [ ] `python scripts/diagnose.py --exam exam.json --symptom "..."` exits 0
269+
and surfaced exactly one HIGH-confidence diagnosis, OR it exited 1
270+
and the user has been routed to the right upstream tracker.
271+
- [ ] If a fix was applied: the recipe's `verify` command exits cleanly.
272+
- [ ] The user's *original* failing command now succeeds end-to-end (run
273+
it again in their original shell).
274+
- [ ] If any fix needed a re-login or reboot, the user has actually done
275+
it before declaring success.
276+
277+
If any box is unchecked, the failure isn't resolved -- say so out loud
278+
rather than declaring victory.
279+
280+
## Reference
281+
282+
For the full catalog of known misconfigurations, every fix-id and its
283+
verify command, the silent-footgun env-var reference, and the
284+
upstream-routing table in machine-readable form, see
285+
[reference.md](reference.md).

0 commit comments

Comments
 (0)