Skip to content

Commit 67f04d6

Browse files
committed
Cleaner
1 parent 7c6bec5 commit 67f04d6

5 files changed

Lines changed: 90 additions & 80 deletions

File tree

skills/rocm-doctor/SKILL.md

Lines changed: 75 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,58 @@
11
---
22
name: rocm-doctor
33
description: >-
4-
Diagnoses why ROCm, PyTorch, or llama.cpp isn't working on an AMD GPU
5-
by matching the symptom against twelve known misconfigurations and
6-
either applying a low-risk fix with consent or handing back the exact
7-
next step. Use when the user says "ROCm/HIP isn't working",
8-
"torch.cuda.is_available() is False on Radeon/Ryzen AI",
4+
Diagnoses why ROCm, PyTorch, or llama.cpp fails on an AMD GPU by
5+
matching symptoms against a closed catalog of known misconfigurations,
6+
then either applies a low-risk fix with consent or hands back the
7+
exact next step. Also routes Lemonade, LM Studio, and Ollama users to
8+
the right upstream channel. Use when the user says "ROCm/HIP isn't
9+
working", "torch.cuda.is_available() is False on Radeon/Ryzen AI",
910
"rocminfo can't find my GPU", "hipErrorNoBinaryForGpu",
1011
"HSA_STATUS_ERROR_INVALID_ISA", "invalid device function",
1112
"Unable to open /dev/kfd", "ROCk module is NOT loaded",
12-
"libamdhip64.so cannot open shared object file", "amdgpu-install broke
13-
apt", "ROCm wheel doesn't see my gfx1151/gfx1150/gfx1103 (Strix Halo,
14-
Phoenix)", "iGPU/dGPU collision", "multi-GPU hang"; or mentions
15-
HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH,
16-
render group / /dev/kfd permissions, amdgpu blacklist, or Secure Boot
17-
blocking amdgpu. Do NOT use for non-AMD GPUs, fresh ROCm installs,
18-
performance tuning, or Lemonade/LM Studio/Ollama -- those ship their
19-
own ROCm; route upstream.
13+
"libamdhip64.so cannot open shared object file", "ROCm wheel doesn't
14+
see my gfx1151/gfx1150/gfx1103 (Strix Halo, Phoenix)", "iGPU/dGPU
15+
collision", "multi-GPU hang on AMD"; or mentions HSA_OVERRIDE_GFX_VERSION,
16+
HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
17+
permissions, amdgpu blacklist, Secure Boot, or asks where to file a
18+
Lemonade / LM Studio / Ollama issue. Do NOT use for non-AMD GPUs,
19+
fresh installs, or performance tuning.
2020
---
2121

2222
# ROCm Doctor
2323

2424
Given a "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU" complaint,
25-
identify which of a fixed list of **twelve known misconfigurations** is
26-
the cause and either fix it or hand back the exact next step.
25+
identify which **known misconfiguration** is the cause and either fix it
26+
or hand back the exact next step.
2727

28-
This is a diagnose-and-fix skill, not a setup or tuning skill. The closed
29-
list is deliberate: if the user's symptom doesn't match one of the twelve,
30-
the skill explicitly routes upstream rather than guessing.
28+
This is a diagnose-and-fix skill, not a setup or tuning skill. The
29+
catalog of failure modes is a **closed list** that lives in
30+
`reference.md` and `scripts/diagnose.py`: if the user's symptom doesn't
31+
match one of them, the skill explicitly routes upstream rather than
32+
guessing. New failure modes get added by editing the catalog, not by
33+
the agent inventing them at runtime.
3134

3235
## When to use this skill
3336

34-
Use it when **all** of the following are true:
37+
Use it when **any** of the following are true:
3538

36-
- The user has an **AMD** GPU (APU or discrete). NVIDIA / Intel / Apple
37-
Silicon are out of scope; exit cleanly and route the user.
38-
- The user's framework is **PyTorch**, **llama.cpp**, or anything else
39-
built directly against the system ROCm (`/opt/rocm` or a pip wheel that
40-
bundles HIP). Lemonade, LM Studio, and Ollama ship their own runtimes
41-
and bypass the system install entirely; skip examination and route
42-
upstream (see [Framework routing](#framework-routing)).
43-
- There is a **functional** error (import fails, `torch.cuda.is_available()`
44-
is `False`, `rocminfo` errors, a kernel can't launch). Pure performance
45-
complaints belong in `mi-tuner` / `omniperf-tune` / `apu-memory-tuner`.
39+
- The user has an **AMD** GPU and a functional error with **PyTorch**,
40+
**llama.cpp**, or anything else built directly against the system ROCm
41+
(`/opt/rocm` or a pip wheel that bundles HIP). The skill examines the
42+
host and diagnoses against the catalog.
43+
- The user is on **Lemonade**, **LM Studio**, or **Ollama**. These apps
44+
ship their own ROCm and don't need a host-level examination, but the
45+
user often doesn't know *where* to report the problem -- the skill
46+
knows the right upstream channel for each (see
47+
[Framework routing](#framework-routing)) and hands it over.
4648

47-
Do not use it for fresh installs on a clean machine. That is a setup task;
48-
point the user at `amdgpu-install` from the [AMD ROCm install
49-
guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html).
49+
Out of scope:
50+
51+
- NVIDIA / Intel / Apple Silicon GPUs. Exit cleanly and tell the user.
52+
- Fresh installs on a clean machine. That's a setup task; point at
53+
[`amdgpu-install`](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html).
54+
- Pure performance complaints. Those belong in `mi-tuner` /
55+
`omniperf-tune` / `apu-memory-tuner`.
5056

5157
## Prerequisites
5258

@@ -60,9 +66,8 @@ guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/ins
6066
- `python` / `python3` to introspect PyTorch
6167
- `llama-cli` / `llama-server` / `main` to introspect llama.cpp
6268
- **Permissions:** examination is fully read-only and works as a regular
63-
user. Some fixes (`fix-4-render-group`, `fix-5-amdgpu-load`,
64-
`fix-7-stale-repos`, `fix-11-iommu`, `fix-12-installer`) need `sudo`;
65-
the script always prints the command before asking for consent.
69+
user. Several fixes need `sudo` (the recipe metadata flags this); the
70+
script always prints the command before asking for consent.
6671

6772
Silent footguns to surface explicitly when relevant:
6873

@@ -85,7 +90,7 @@ changing anything.
8590

8691
```
8792
[ ] 1. Identify the framework, then examine (read-only).
88-
[ ] 2. Diagnose: match examination + symptom against the twelve known cases.
93+
[ ] 2. Diagnose: match examination + symptom against the catalog.
8994
[ ] 3. Propose the fix; only apply with explicit consent; re-verify.
9095
```
9196

@@ -118,10 +123,10 @@ For a quick read-only summary without piping JSON, drop `--json`:
118123
python scripts/examine.py --framework pytorch
119124
```
120125

121-
`examine.py` collects exactly the facts the twelve-case decision tree
122-
needs: OS / kernel, AMD GPUs and gfx targets, `amdgpu` / `amdkfd`
123-
status, `/dev/kfd` ownership and group, user's group membership, system
124-
ROCm version and install method, framework version and arch list, the
126+
`examine.py` collects exactly the facts the diagnosis catalog needs:
127+
OS / kernel, AMD GPUs and gfx targets, `amdgpu` / `amdkfd` status,
128+
`/dev/kfd` ownership and group, user's group membership, system ROCm
129+
version and install method, framework version and arch list, the
125130
silent-footgun env vars, container/IOMMU state, and recent `amdgpu`
126131
kernel log lines. It deliberately does NOT spawn heavy probes (no kernel
127132
launches, no model downloads).
@@ -135,8 +140,8 @@ python scripts/diagnose.py --exam exam.json \
135140
--symptom "HIP error: invalid device function on gfx1151"
136141
```
137142

138-
The script runs the twelve checkers, scores each from 0..100, and prints
139-
a ranked list. Each match has a stable `fix-N-...` id used by
143+
The script runs every checker in the catalog, scores each from 0..100,
144+
and prints a ranked list. Each match has a stable `fix-N-...` id used by
140145
`apply_fix.py`.
141146

142147
Score tiers:
@@ -169,24 +174,20 @@ python scripts/apply_fix.py --fix-id fix-4-render-group --yes
169174
the interactive `[y/N]` prompt (only pass this after the user has agreed
170175
in chat).
171176

172-
Five of the twelve fixes are auto-applicable; the rest are deliberately
177+
A subset of fixes are auto-applicable; the rest are deliberately
173178
print-only because the risk of a half-applied state is too high for an
174-
agent to take:
179+
agent to take. To see which is which without consulting `reference.md`:
175180

176-
| Fix-id | Auto? | Why |
177-
|---|---|---|
178-
| `fix-1-arch` | Print-only | Reinstalls a framework; user must approve and pick the wheel index. |
179-
| `fix-2-unset-override` | Auto | Just unsets an env var + flags persistent rc lines. |
180-
| `fix-3-rocm-kernel` | Print-only | Upgrading kernels needs the user. |
181-
| `fix-4-render-group` | Auto | `usermod -a -G render,video $USER` is well-bounded. |
182-
| `fix-5-amdgpu-load` | Print-only | Editing modprobe.d + initramfs regen needs the user. |
183-
| `fix-6-path` | Auto | Appends one line to `~/.bashrc` / `~/.zshrc`. |
184-
| `fix-7-stale-repos` | Print-only | Moving repo files is destructive enough to require the user. |
185-
| `fix-8-wheel-rocm` | Print-only | Reinstalls a framework. |
186-
| `fix-9-igpu-dgpu` | Auto | Adds `export HIP_VISIBLE_DEVICES=N` (user supplies N via `--device-index`). |
187-
| `fix-10-container` | Print-only | Re-launches a container. |
188-
| `fix-11-iommu` | Print-only | Edits GRUB and reboots. |
189-
| `fix-12-installer` | Print-only | Reinstalls system packages. |
181+
```bash
182+
python scripts/apply_fix.py --list
183+
```
184+
185+
That prints every `fix-id` with an `AUTO` or `PRINT-ONLY` tag. Auto
186+
fixes are bounded operations like unsetting an env var, adding the user
187+
to a group, or appending a single line to a shell rc. Print-only fixes
188+
involve reinstalling frameworks, editing GRUB, regenerating the
189+
initramfs, or moving system repo files; those need a human at the
190+
keyboard.
190191

191192
After every fix, re-run the `verify` command the recipe printed. Only
192193
declare success when the user's *original* failing command now succeeds
@@ -196,21 +197,23 @@ GPU, the llama.cpp build runs).
196197
## Framework routing
197198

198199
The skill's first decision is which framework the user runs. Some
199-
frameworks ship their own ROCm and bypass the system install -- examining
200-
the host is the wrong question for them.
200+
frameworks ship their own ROCm and bypass the system install; for those
201+
the right answer is "you're in the wrong place, here's where to file
202+
it", and the skill delivers that answer directly rather than running
203+
useless probes against the host.
201204

202-
| Framework | Examine the system? | Where to send the user |
205+
| Framework | Examine the host? | Action |
203206
|---|---|---|
204-
| PyTorch | Yes | `python scripts/examine.py --framework pytorch` |
205-
| llama.cpp (built against system ROCm) | Yes | `python scripts/examine.py --framework llama-cpp` |
206-
| Lemonade | No -- ships its own ROCm | <https://github.com/lemonade-sdk/lemonade> + [Discord](https://discord.gg/5xXzkMu8Zk) |
207-
| LM Studio | No -- ships its own runtime | <https://lmstudio.ai/docs/app> + Discord |
208-
| Ollama | No -- ships its own runtime | <https://github.com/ollama/ollama> + Discord |
207+
| PyTorch | Yes | `python scripts/examine.py --framework pytorch`, then `diagnose.py`. |
208+
| llama.cpp (built against system ROCm) | Yes | `python scripts/examine.py --framework llama-cpp`, then `diagnose.py`. |
209+
| Lemonade | No -- ships its own ROCm | Route to <https://github.com/lemonade-sdk/lemonade/issues> and the Lemonade [Discord](https://discord.gg/5xXzkMu8Zk). |
210+
| LM Studio | No -- ships its own runtime | Route to <https://lmstudio.ai/docs/app> (in-app support; no public repo). |
211+
| Ollama | No -- ships its own runtime | Route to <https://github.com/ollama/ollama/issues> and the Ollama Discord. |
209212
| vLLM / SGLang | Out of scope until phase 1+ | Route to the project's own issue tracker. |
210213

211-
If a Lemonade / LM Studio / Ollama user really does have a system ROCm
214+
If a Lemonade / LM Studio / Ollama user *does* have a host-level ROCm
212215
problem (rare), it shows up when their app fails AND a standalone
213-
`rocminfo` also fails. Only then run the full examination.
216+
`rocminfo` also fails. Only then escalate to the full examination.
214217

215218
## Safety rules
216219

@@ -223,8 +226,8 @@ problem (rare), it shows up when their app fails AND a standalone
223226
wheel exists. That is `fix-2-unset-override`'s entire reason for being.
224227
- Never silently fall back to a different fix when the requested one
225228
isn't applicable. Exit 3 and tell the user why.
226-
- When nothing matches the twelve known cases, **do not speculate**. Hand
227-
the user the upstream tracker URL from `diagnose.py --json`.
229+
- When nothing in the catalog matches, **do not speculate**. Hand the
230+
user the upstream tracker URL from `diagnose.py --json`.
228231

229232
## Verification checklist
230233

@@ -246,7 +249,7 @@ rather than declaring victory.
246249

247250
## Reference
248251

249-
For the full table of twelve known misconfigurations, every fix-id and
250-
its verify command, the silent-footgun env-var reference, and the
252+
For the full catalog of known misconfigurations, every fix-id and its
253+
verify command, the silent-footgun env-var reference, and the
251254
upstream-routing table in machine-readable form, see
252255
[reference.md](reference.md).

skills/rocm-doctor/reference.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ three-step flow in `SKILL.md` doesn't cover a decision.
55

66
## Contents
77

8-
- [The twelve known misconfigurations](#the-twelve-known-misconfigurations)
8+
- [The known-misconfigurations catalog](#the-known-misconfigurations-catalog)
99
- [Silent-footgun environment variables](#silent-footgun-environment-variables)
1010
- [Framework support matrix](#framework-support-matrix)
1111
- [Device support, phased](#device-support-phased)
@@ -17,13 +17,20 @@ three-step flow in `SKILL.md` doesn't cover a decision.
1717

1818
---
1919

20-
## The twelve known misconfigurations
20+
## The known-misconfigurations catalog
2121

2222
The closed list `diagnose.py` checks against. Each row maps to one
23-
`fix-N-...` recipe in `apply_fix.py`. **If a user's symptom doesn't match
24-
one of these twelve, the skill must not speculate** -- it exits 1 and
23+
`fix-N-...` recipe in `apply_fix.py`. **If a user's symptom doesn't
24+
match any of these, the skill must not speculate** -- it exits 1 and
2525
prints the upstream tracker URL from `_route_when_no_match`.
2626

27+
This catalog grows over time. To add a new failure mode: add a
28+
`check_N_*` function to `scripts/diagnose.py`, a `FixRecipe` with the
29+
matching `fix-id` to `scripts/apply_fix.py`'s `RECIPES`, and a row to
30+
the table below. The decision-tree contract -- score 0..100, emit the
31+
recipe's `verify` command on a hit, exit 1 + route upstream on a miss --
32+
stays the same regardless of catalog size.
33+
2734
| # | fix-id | Failure pattern | Typical signal | Default fix |
2835
|---|---|---|---|---|
2936
| 1 | `fix-1-arch` | GPU `gfx` target not in framework's compiled arch list | `hipErrorNoBinaryForGpu`, `HIP error: invalid device function`, `HSA_STATUS_ERROR_INVALID_ISA`, `torch.cuda.get_arch_list()` missing the GPU's gfx | Reinstall the framework from a wheel index that ships kernels for the GPU's gfx (TheRock per-gfx wheels are the recommended fallback). |

skills/rocm-doctor/scripts/apply_fix.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ def run_hip_visible_devices(args, recipe: FixRecipe) -> int:
344344

345345

346346
# ---------------------------------------------------------------------------
347-
# Recipe registry. Mirrors the twelve diagnoses in `diagnose.py`. Only the
347+
# Recipe registry. Mirrors the diagnosis catalog in `diagnose.py`. Only the
348348
# small, safe, well-bounded fixes are auto-applicable; everything else is
349349
# advisory and prints the plan only.
350350
# ---------------------------------------------------------------------------

skills/rocm-doctor/scripts/diagnose.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
1. The JSON output of `examine.py` (machine state).
1212
2. Optionally the user's error text (symptom).
1313
14-
and returns a ranked list of matches against the twelve known
14+
and returns a ranked list of matches against the catalog of known
1515
misconfigurations in `reference.md`. Each match comes with:
1616
1717
- id : stable identifier reused by `apply_fix.py` (e.g. "fix-4-render-group").

skills/rocm-doctor/scripts/examine.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
and anything built against `/opt/rocm`, but NOT Lemonade / LM Studio /
1111
Ollama, which ship their own runtime).
1212
13-
The script collects the minimum set of facts needed to disambiguate the
14-
twelve known misconfigurations in `reference.md`. It never installs or
13+
The script collects the minimum set of facts needed to disambiguate
14+
every known misconfiguration in `reference.md`. It never installs or
1515
removes packages, never changes group membership, and never edits files.
1616
1717
Exit codes:

0 commit comments

Comments
 (0)