11---
22name : rocm-doctor
33description : >-
4- Diagnoses why ROCm, PyTorch, or llama.cpp isn't working on an AMD GPU
5- by matching the symptom against twelve known misconfigurations and
6- either applying a low-risk fix with consent or handing back the exact
7- next step. Use when the user says "ROCm/HIP isn't working",
8- "torch.cuda.is_available() is False on Radeon/Ryzen AI",
4+ Diagnoses why ROCm, PyTorch, or llama.cpp fails on an AMD GPU by
5+ matching symptoms against a closed catalog of known misconfigurations,
6+ then either applies a low-risk fix with consent or hands back the
7+ exact next step. Also routes Lemonade, LM Studio, and Ollama users to
8+ the right upstream channel. Use when the user says "ROCm/HIP isn't
9+ working", "torch.cuda.is_available() is False on Radeon/Ryzen AI",
910 "rocminfo can't find my GPU", "hipErrorNoBinaryForGpu",
1011 "HSA_STATUS_ERROR_INVALID_ISA", "invalid device function",
1112 "Unable to open /dev/kfd", "ROCk module is NOT loaded",
12- "libamdhip64.so cannot open shared object file", "amdgpu-install broke
13- apt", "ROCm wheel doesn't see my gfx1151/gfx1150/gfx1103 (Strix Halo,
14- Phoenix)", "iGPU/dGPU collision", "multi-GPU hang"; or mentions
15- HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH,
16- render group / /dev/kfd permissions, amdgpu blacklist, or Secure Boot
17- blocking amdgpu. Do NOT use for non-AMD GPUs, fresh ROCm installs,
18- performance tuning, or Lemonade/LM Studio/Ollama -- those ship their
19- own ROCm; route upstream.
13+ "libamdhip64.so cannot open shared object file", "ROCm wheel doesn't
14+ see my gfx1151/gfx1150/gfx1103 (Strix Halo, Phoenix)", "iGPU/dGPU
15+ collision", "multi-GPU hang on AMD"; or mentions HSA_OVERRIDE_GFX_VERSION,
16+ HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
17+ permissions, amdgpu blacklist, Secure Boot, or asks where to file a
18+ Lemonade / LM Studio / Ollama issue. Do NOT use for non-AMD GPUs,
19+ fresh installs, or performance tuning.
2020---
2121
2222# ROCm Doctor
2323
2424Given a "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU" complaint,
25- identify which of a fixed list of ** twelve known misconfigurations ** is
26- the cause and either fix it or hand back the exact next step.
25+ identify which ** known misconfiguration ** is the cause and either fix it
26+ or hand back the exact next step.
2727
28- This is a diagnose-and-fix skill, not a setup or tuning skill. The closed
29- list is deliberate: if the user's symptom doesn't match one of the twelve,
30- the skill explicitly routes upstream rather than guessing.
28+ This is a diagnose-and-fix skill, not a setup or tuning skill. The
29+ catalog of failure modes is a ** closed list** that lives in
30+ ` reference.md ` and ` scripts/diagnose.py ` : if the user's symptom doesn't
31+ match one of them, the skill explicitly routes upstream rather than
32+ guessing. New failure modes get added by editing the catalog, not by
33+ the agent inventing them at runtime.
3134
3235## When to use this skill
3336
34- Use it when ** all ** of the following are true:
37+ Use it when ** any ** of the following are true:
3538
36- - The user has an ** AMD** GPU (APU or discrete). NVIDIA / Intel / Apple
37- Silicon are out of scope; exit cleanly and route the user.
38- - The user's framework is ** PyTorch** , ** llama.cpp** , or anything else
39- built directly against the system ROCm (` /opt/rocm ` or a pip wheel that
40- bundles HIP). Lemonade, LM Studio, and Ollama ship their own runtimes
41- and bypass the system install entirely; skip examination and route
42- upstream (see [ Framework routing] ( #framework-routing ) ).
43- - There is a ** functional** error (import fails, ` torch.cuda.is_available() `
44- is ` False ` , ` rocminfo ` errors, a kernel can't launch). Pure performance
45- complaints belong in ` mi-tuner ` / ` omniperf-tune ` / ` apu-memory-tuner ` .
39+ - The user has an ** AMD** GPU and a functional error with ** PyTorch** ,
40+ ** llama.cpp** , or anything else built directly against the system ROCm
41+ (` /opt/rocm ` or a pip wheel that bundles HIP). The skill examines the
42+ host and diagnoses against the catalog.
43+ - The user is on ** Lemonade** , ** LM Studio** , or ** Ollama** . These apps
44+ ship their own ROCm and don't need a host-level examination, but the
45+ user often doesn't know * where* to report the problem -- the skill
46+ knows the right upstream channel for each (see
47+ [ Framework routing] ( #framework-routing ) ) and hands it over.
4648
47- Do not use it for fresh installs on a clean machine. That is a setup task;
48- point the user at ` amdgpu-install ` from the [ AMD ROCm install
49- guide] ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html ) .
49+ Out of scope:
50+
51+ - NVIDIA / Intel / Apple Silicon GPUs. Exit cleanly and tell the user.
52+ - Fresh installs on a clean machine. That's a setup task; point at
53+ [ ` amdgpu-install ` ] ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html ) .
54+ - Pure performance complaints. Those belong in ` mi-tuner ` /
55+ ` omniperf-tune ` / ` apu-memory-tuner ` .
5056
5157## Prerequisites
5258
@@ -60,9 +66,8 @@ guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/ins
6066 - ` python ` / ` python3 ` to introspect PyTorch
6167 - ` llama-cli ` / ` llama-server ` / ` main ` to introspect llama.cpp
6268- ** Permissions:** examination is fully read-only and works as a regular
63- user. Some fixes (` fix-4-render-group ` , ` fix-5-amdgpu-load ` ,
64- ` fix-7-stale-repos ` , ` fix-11-iommu ` , ` fix-12-installer ` ) need ` sudo ` ;
65- the script always prints the command before asking for consent.
69+ user. Several fixes need ` sudo ` (the recipe metadata flags this); the
70+ script always prints the command before asking for consent.
6671
6772Silent footguns to surface explicitly when relevant:
6873
@@ -85,7 +90,7 @@ changing anything.
8590
8691```
8792[ ] 1. Identify the framework, then examine (read-only).
88- [ ] 2. Diagnose: match examination + symptom against the twelve known cases .
93+ [ ] 2. Diagnose: match examination + symptom against the catalog .
8994[ ] 3. Propose the fix; only apply with explicit consent; re-verify.
9095```
9196
@@ -118,10 +123,10 @@ For a quick read-only summary without piping JSON, drop `--json`:
118123python scripts/examine.py --framework pytorch
119124```
120125
121- ` examine.py ` collects exactly the facts the twelve-case decision tree
122- needs: OS / kernel, AMD GPUs and gfx targets, ` amdgpu ` / ` amdkfd `
123- status, ` /dev/kfd ` ownership and group, user's group membership, system
124- ROCm version and install method, framework version and arch list, the
126+ ` examine.py ` collects exactly the facts the diagnosis catalog needs:
127+ OS / kernel, AMD GPUs and gfx targets, ` amdgpu ` / ` amdkfd ` status,
128+ ` /dev/kfd ` ownership and group, user's group membership, system ROCm
129+ version and install method, framework version and arch list, the
125130silent-footgun env vars, container/IOMMU state, and recent ` amdgpu `
126131kernel log lines. It deliberately does NOT spawn heavy probes (no kernel
127132launches, no model downloads).
@@ -135,8 +140,8 @@ python scripts/diagnose.py --exam exam.json \
135140 --symptom " HIP error: invalid device function on gfx1151"
136141```
137142
138- The script runs the twelve checkers , scores each from 0..100, and prints
139- a ranked list. Each match has a stable ` fix-N-... ` id used by
143+ The script runs every checker in the catalog , scores each from 0..100,
144+ and prints a ranked list. Each match has a stable ` fix-N-... ` id used by
140145` apply_fix.py ` .
141146
142147Score tiers:
@@ -169,24 +174,20 @@ python scripts/apply_fix.py --fix-id fix-4-render-group --yes
169174the interactive ` [y/N] ` prompt (only pass this after the user has agreed
170175in chat).
171176
172- Five of the twelve fixes are auto-applicable; the rest are deliberately
177+ A subset of fixes are auto-applicable; the rest are deliberately
173178print-only because the risk of a half-applied state is too high for an
174- agent to take:
179+ agent to take. To see which is which without consulting ` reference.md ` :
175180
176- | Fix-id | Auto? | Why |
177- | ---| ---| ---|
178- | ` fix-1-arch ` | Print-only | Reinstalls a framework; user must approve and pick the wheel index. |
179- | ` fix-2-unset-override ` | Auto | Just unsets an env var + flags persistent rc lines. |
180- | ` fix-3-rocm-kernel ` | Print-only | Upgrading kernels needs the user. |
181- | ` fix-4-render-group ` | Auto | ` usermod -a -G render,video $USER ` is well-bounded. |
182- | ` fix-5-amdgpu-load ` | Print-only | Editing modprobe.d + initramfs regen needs the user. |
183- | ` fix-6-path ` | Auto | Appends one line to ` ~/.bashrc ` / ` ~/.zshrc ` . |
184- | ` fix-7-stale-repos ` | Print-only | Moving repo files is destructive enough to require the user. |
185- | ` fix-8-wheel-rocm ` | Print-only | Reinstalls a framework. |
186- | ` fix-9-igpu-dgpu ` | Auto | Adds ` export HIP_VISIBLE_DEVICES=N ` (user supplies N via ` --device-index ` ). |
187- | ` fix-10-container ` | Print-only | Re-launches a container. |
188- | ` fix-11-iommu ` | Print-only | Edits GRUB and reboots. |
189- | ` fix-12-installer ` | Print-only | Reinstalls system packages. |
181+ ``` bash
182+ python scripts/apply_fix.py --list
183+ ```
184+
185+ That prints every ` fix-id ` with an ` AUTO ` or ` PRINT-ONLY ` tag. Auto
186+ fixes are bounded operations like unsetting an env var, adding the user
187+ to a group, or appending a single line to a shell rc. Print-only fixes
188+ involve reinstalling frameworks, editing GRUB, regenerating the
189+ initramfs, or moving system repo files; those need a human at the
190+ keyboard.
190191
191192After every fix, re-run the ` verify ` command the recipe printed. Only
192193declare success when the user's * original* failing command now succeeds
@@ -196,21 +197,23 @@ GPU, the llama.cpp build runs).
196197## Framework routing
197198
198199The skill's first decision is which framework the user runs. Some
199- frameworks ship their own ROCm and bypass the system install -- examining
200- the host is the wrong question for them.
200+ frameworks ship their own ROCm and bypass the system install; for those
201+ the right answer is "you're in the wrong place, here's where to file
202+ it", and the skill delivers that answer directly rather than running
203+ useless probes against the host.
201204
202- | Framework | Examine the system ? | Where to send the user |
205+ | Framework | Examine the host ? | Action |
203206| ---| ---| ---|
204- | PyTorch | Yes | ` python scripts/examine.py --framework pytorch ` |
205- | llama.cpp (built against system ROCm) | Yes | ` python scripts/examine.py --framework llama-cpp ` |
206- | Lemonade | No -- ships its own ROCm | < https://github.com/lemonade-sdk/lemonade > + [ Discord] ( https://discord.gg/5xXzkMu8Zk ) |
207- | LM Studio | No -- ships its own runtime | < https://lmstudio.ai/docs/app > + Discord |
208- | Ollama | No -- ships its own runtime | < https://github.com/ollama/ollama > + Discord |
207+ | PyTorch | Yes | ` python scripts/examine.py --framework pytorch ` , then ` diagnose.py ` . |
208+ | llama.cpp (built against system ROCm) | Yes | ` python scripts/examine.py --framework llama-cpp ` , then ` diagnose.py ` . |
209+ | Lemonade | No -- ships its own ROCm | Route to < https://github.com/lemonade-sdk/lemonade/issues > and the Lemonade [ Discord] ( https://discord.gg/5xXzkMu8Zk ) . |
210+ | LM Studio | No -- ships its own runtime | Route to < https://lmstudio.ai/docs/app > (in-app support; no public repo). |
211+ | Ollama | No -- ships its own runtime | Route to < https://github.com/ollama/ollama/issues > and the Ollama Discord. |
209212| vLLM / SGLang | Out of scope until phase 1+ | Route to the project's own issue tracker. |
210213
211- If a Lemonade / LM Studio / Ollama user really does have a system ROCm
214+ If a Lemonade / LM Studio / Ollama user * does* have a host-level ROCm
212215problem (rare), it shows up when their app fails AND a standalone
213- ` rocminfo ` also fails. Only then run the full examination.
216+ ` rocminfo ` also fails. Only then escalate to the full examination.
214217
215218## Safety rules
216219
@@ -223,8 +226,8 @@ problem (rare), it shows up when their app fails AND a standalone
223226 wheel exists. That is ` fix-2-unset-override ` 's entire reason for being.
224227- Never silently fall back to a different fix when the requested one
225228 isn't applicable. Exit 3 and tell the user why.
226- - When nothing matches the twelve known cases , ** do not speculate** . Hand
227- the user the upstream tracker URL from ` diagnose.py --json ` .
229+ - When nothing in the catalog matches , ** do not speculate** . Hand the
230+ user the upstream tracker URL from ` diagnose.py --json ` .
228231
229232## Verification checklist
230233
@@ -246,7 +249,7 @@ rather than declaring victory.
246249
247250## Reference
248251
249- For the full table of twelve known misconfigurations, every fix-id and
250- its verify command, the silent-footgun env-var reference, and the
252+ For the full catalog of known misconfigurations, every fix-id and its
253+ verify command, the silent-footgun env-var reference, and the
251254upstream-routing table in machine-readable form, see
252255[ reference.md] ( reference.md ) .
0 commit comments