Skip to content

Commit 042ff80

Browse files
committed
Add Windows to scope
1 parent 67f04d6 commit 042ff80

5 files changed

Lines changed: 1339 additions & 259 deletions

File tree

skills/rocm-doctor/SKILL.md

Lines changed: 64 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,25 @@
11
---
22
name: rocm-doctor
33
description: >-
4-
Diagnoses why ROCm, PyTorch, or llama.cpp fails on an AMD GPU by
5-
matching symptoms against a closed catalog of known misconfigurations,
6-
then either applies a low-risk fix with consent or hands back the
7-
exact next step. Also routes Lemonade, LM Studio, and Ollama users to
8-
the right upstream channel. Use when the user says "ROCm/HIP isn't
9-
working", "torch.cuda.is_available() is False on Radeon/Ryzen AI",
10-
"rocminfo can't find my GPU", "hipErrorNoBinaryForGpu",
4+
Diagnoses why ROCm, the HIP SDK, PyTorch, or llama.cpp fails on an AMD
5+
GPU by matching symptoms against a closed catalog of known
6+
misconfigurations on Linux and Windows, then either applies a low-risk
7+
fix with consent or hands back the exact next step. Also routes
8+
Lemonade, LM Studio, and Ollama users to the right upstream channel.
9+
Use when the user says "ROCm/HIP isn't working", "torch.cuda.is_available()
10+
is False on Radeon/Ryzen AI", "rocminfo can't find my GPU",
11+
"hipInfo.exe can't see my Radeon", "amdhip64_6.dll could not be found",
12+
"vcruntime140_1.dll missing", "HIP SDK installer left things broken",
13+
"Adrenalin driver too old for the HIP SDK", "hipErrorNoBinaryForGpu",
1114
"HSA_STATUS_ERROR_INVALID_ISA", "invalid device function",
1215
"Unable to open /dev/kfd", "ROCk module is NOT loaded",
1316
"libamdhip64.so cannot open shared object file", "ROCm wheel doesn't
1417
see my gfx1151/gfx1150/gfx1103 (Strix Halo, Phoenix)", "iGPU/dGPU
1518
collision", "multi-GPU hang on AMD"; or mentions HSA_OVERRIDE_GFX_VERSION,
16-
HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
19+
HIP_VISIBLE_DEVICES, HIP_PATH, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
1720
permissions, amdgpu blacklist, Secure Boot, or asks where to file a
1821
Lemonade / LM Studio / Ollama issue. Do NOT use for non-AMD GPUs,
19-
fresh installs, or performance tuning.
22+
fresh installs, performance tuning, or ROCm-on-WSL2.
2023
---
2124

2225
# ROCm Doctor
@@ -50,38 +53,63 @@ Out of scope:
5053

5154
- NVIDIA / Intel / Apple Silicon GPUs. Exit cleanly and tell the user.
5255
- Fresh installs on a clean machine. That's a setup task; point at
53-
[`amdgpu-install`](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html).
56+
[`amdgpu-install`](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html)
57+
(Linux) or the [HIP SDK installer](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html)
58+
(Windows).
5459
- Pure performance complaints. Those belong in `mi-tuner` /
5560
`omniperf-tune` / `apu-memory-tuner`.
61+
- **WSL2** (running Linux on top of Windows). The ROCm-on-WSL flow needs
62+
Adrenalin Pro plus the WSL kernel update on the Windows host -- those
63+
failure modes are not in this catalog. `examine.py` detects WSL via
64+
`/proc/version` and exits 2 with a route-out message; if the user wants
65+
WSL specifically, point them at <https://rocm.docs.amd.com/projects/install-on-wsl/en/latest/>.
5666

5767
## Prerequisites
5868

59-
- **OS:** Linux. Phase 0 is Linux-only; on Windows, the HIP SDK / Adrenalin
60-
path is its own ecosystem and this skill cannot help.
61-
- **Tools the agent will invoke as part of examination** (best-effort; the
62-
script degrades when one is missing):
69+
- **OS:** Linux **or** Windows (native). The catalog has 12 Linux entries
70+
(5 of which are also valid on Windows) and 3 Windows-only entries; the
71+
scripts pick the right subset for the host they run on.
72+
- **Linux tools the agent will invoke as part of examination** (best-effort;
73+
the script degrades when one is missing):
6374
- `lspci` (always present on desktop distros)
6475
- `rocminfo` (when ROCm is installed)
6576
- `journalctl` or `dmesg` (for amdgpu kernel-ring evidence)
6677
- `python` / `python3` to introspect PyTorch
6778
- `llama-cli` / `llama-server` / `main` to introspect llama.cpp
79+
- **Windows tools the agent will invoke as part of examination**:
80+
- `powershell` (always present on Windows 10+) for `Get-CimInstance
81+
Win32_VideoController` / `Win32_Processor` and the env-scope reads.
82+
- `hipInfo.exe` from `%HIP_PATH%\bin` -- the Windows analog of `rocminfo`.
83+
Absence is itself a signal (see `fix-13-hip-sdk-missing`).
84+
- `setx` for env-var persistence and User-PATH edits (analog of editing
85+
`~/.bashrc` on Linux).
86+
- `python` to introspect PyTorch.
6887
- **Permissions:** examination is fully read-only and works as a regular
69-
user. Several fixes need `sudo` (the recipe metadata flags this); the
70-
script always prints the command before asking for consent.
88+
user on both OSes. Linux fixes that need `sudo` are flagged in their
89+
recipe metadata; Windows fixes that touch the Machine env scope are
90+
flagged similarly and `apply_fix.py` does NOT self-elevate -- the user
91+
has to run an Administrator PowerShell when those are required.
7192

7293
Silent footguns to surface explicitly when relevant:
7394

7495
- `HSA_OVERRIDE_GFX_VERSION` -- forcing an unsupported gfx target works
75-
for `rocminfo` but causes page faults at runtime. Diagnosis
96+
for `rocminfo`/`hipInfo` but causes page faults at runtime. Diagnosis
7697
`fix-2-unset-override` is the response when this is set on a GPU that
77-
already has a native wheel.
98+
already has a native wheel; on Windows it can be persisted in either
99+
the User or Machine env scope, so check both.
78100
- `HIP_VISIBLE_DEVICES` -- on dual-GPU systems (APU + dGPU) the iGPU is
79101
often index 0 and destabilises HIP unless explicitly hidden.
102+
- `HIP_PATH` (Windows) -- if the user has multiple HIP SDK versions
103+
installed under `C:\Program Files\AMD\ROCm\`, `HIP_PATH` decides which
104+
one PyTorch / hipInfo actually loads. Pointing it at the wrong major
105+
produces the same failure mode as `fix-8-wheel-rocm`.
80106
- `PYTORCH_ROCM_ARCH` -- only honored during a *build* of PyTorch. Setting
81107
it at runtime does nothing for a prebuilt wheel.
82-
- `LD_LIBRARY_PATH` -- a wheel-bundled `libamdhip64.so` shadowed by a
83-
system one (or vice versa) gives confusing `cannot open shared object
84-
file` errors that look like fix-8 but are really a load-order bug.
108+
- `LD_LIBRARY_PATH` (Linux) -- a wheel-bundled `libamdhip64.so` shadowed
109+
by a system one (or vice versa) gives confusing `cannot open shared
110+
object file` errors that look like fix-8 but are really a load-order
111+
bug. The Windows analog is `PATH` order: a stale HIP SDK bin directory
112+
earlier on PATH than the one matching `HIP_PATH`.
85113

86114
## The three-step flow
87115

@@ -114,7 +142,7 @@ script pick. Exit codes:
114142
| Exit | Meaning | Next action |
115143
|---|---|---|
116144
| 0 | Examined; AMD GPU present | Continue to Step 2. |
117-
| 2 | Not Linux / no AMD GPU | Stop. Route the user. |
145+
| 2 | Wrong platform (WSL, neither Linux nor Windows, no AMD GPU) | Stop. Route the user. |
118146
| 3 | Probes partially failed | Continue but warn the user. |
119147

120148
For a quick read-only summary without piping JSON, drop `--json`:
@@ -123,13 +151,16 @@ For a quick read-only summary without piping JSON, drop `--json`:
123151
python scripts/examine.py --framework pytorch
124152
```
125153

126-
`examine.py` collects exactly the facts the diagnosis catalog needs:
127-
OS / kernel, AMD GPUs and gfx targets, `amdgpu` / `amdkfd` status,
128-
`/dev/kfd` ownership and group, user's group membership, system ROCm
129-
version and install method, framework version and arch list, the
154+
`examine.py` collects exactly the facts the diagnosis catalog needs.
155+
On Linux: OS / kernel, AMD GPUs and gfx targets, `amdgpu` / `amdkfd`
156+
status, `/dev/kfd` ownership and group, user's group membership, system
157+
ROCm version and install method, framework version and arch list, the
130158
silent-footgun env vars, container/IOMMU state, and recent `amdgpu`
131-
kernel log lines. It deliberately does NOT spawn heavy probes (no kernel
132-
launches, no model downloads).
159+
kernel log lines. On Windows: AMD adapters and gfx targets via
160+
`Win32_VideoController` + `hipInfo.exe`, the HIP SDK install path and
161+
version, the Adrenalin / kernel-mode driver version, MSVC redistributable
162+
presence, and the same env-var snapshot. It deliberately does NOT spawn
163+
heavy probes (no kernel launches, no model downloads).
133164

134165
### Step 2: diagnose
135166

@@ -204,16 +235,18 @@ useless probes against the host.
204235

205236
| Framework | Examine the host? | Action |
206237
|---|---|---|
207-
| PyTorch | Yes | `python scripts/examine.py --framework pytorch`, then `diagnose.py`. |
208-
| llama.cpp (built against system ROCm) | Yes | `python scripts/examine.py --framework llama-cpp`, then `diagnose.py`. |
238+
| PyTorch (Linux ROCm wheel) | Yes | `python scripts/examine.py --framework pytorch`, then `diagnose.py`. |
239+
| PyTorch (Windows TheRock wheel) | Yes | Same scripts; on Windows the catalog filters to Linux+Windows + Windows-only entries. |
240+
| llama.cpp (built against system ROCm/HIP SDK) | Yes | `python scripts/examine.py --framework llama-cpp`, then `diagnose.py`. |
209241
| Lemonade | No -- ships its own ROCm | Route to <https://github.com/lemonade-sdk/lemonade/issues> and the Lemonade [Discord](https://discord.gg/5xXzkMu8Zk). |
210242
| LM Studio | No -- ships its own runtime | Route to <https://lmstudio.ai/docs/app> (in-app support; no public repo). |
211243
| Ollama | No -- ships its own runtime | Route to <https://github.com/ollama/ollama/issues> and the Ollama Discord. |
212244
| vLLM / SGLang | Out of scope until phase 1+ | Route to the project's own issue tracker. |
213245

214246
If a Lemonade / LM Studio / Ollama user *does* have a host-level ROCm
215247
problem (rare), it shows up when their app fails AND a standalone
216-
`rocminfo` also fails. Only then escalate to the full examination.
248+
`rocminfo` (Linux) / `hipInfo.exe` (Windows) also fails. Only then
249+
escalate to the full examination.
217250

218251
## Safety rules
219252

0 commit comments

Comments
 (0)