11---
22name : rocm-doctor
33description : >-
4- Diagnoses why ROCm, PyTorch, or llama.cpp fails on an AMD GPU by
5- matching symptoms against a closed catalog of known misconfigurations,
6- then either applies a low-risk fix with consent or hands back the
7- exact next step. Also routes Lemonade, LM Studio, and Ollama users to
8- the right upstream channel. Use when the user says "ROCm/HIP isn't
9- working", "torch.cuda.is_available() is False on Radeon/Ryzen AI",
10- "rocminfo can't find my GPU", "hipErrorNoBinaryForGpu",
4+ Diagnoses why ROCm, the HIP SDK, PyTorch, or llama.cpp fails on an AMD
5+ GPU by matching symptoms against a closed catalog of known
6+ misconfigurations on Linux and Windows, then either applies a low-risk
7+ fix with consent or hands back the exact next step. Also routes
8+ Lemonade, LM Studio, and Ollama users to the right upstream channel.
9+ Use when the user says "ROCm/HIP isn't working", "torch.cuda.is_available()
10+ is False on Radeon/Ryzen AI", "rocminfo can't find my GPU",
11+ "hipInfo.exe can't see my Radeon", "amdhip64_6.dll could not be found",
12+ "vcruntime140_1.dll missing", "HIP SDK installer left things broken",
13+ "Adrenalin driver too old for the HIP SDK", "hipErrorNoBinaryForGpu",
1114 "HSA_STATUS_ERROR_INVALID_ISA", "invalid device function",
1215 "Unable to open /dev/kfd", "ROCk module is NOT loaded",
1316 "libamdhip64.so cannot open shared object file", "ROCm wheel doesn't
1417 see my gfx1151/gfx1150/gfx1103 (Strix Halo, Phoenix)", "iGPU/dGPU
1518 collision", "multi-GPU hang on AMD"; or mentions HSA_OVERRIDE_GFX_VERSION,
16- HIP_VISIBLE_DEVICES, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
19+ HIP_VISIBLE_DEVICES, HIP_PATH, PYTORCH_ROCM_ARCH, render-group / /dev/kfd
1720 permissions, amdgpu blacklist, Secure Boot, or asks where to file a
1821 Lemonade / LM Studio / Ollama issue. Do NOT use for non-AMD GPUs,
19- fresh installs, or performance tuning.
22+ fresh installs, performance tuning, or ROCm-on-WSL2 .
2023---
2124
2225# ROCm Doctor
@@ -50,38 +53,63 @@ Out of scope:
5053
5154- NVIDIA / Intel / Apple Silicon GPUs. Exit cleanly and tell the user.
5255- Fresh installs on a clean machine. That's a setup task; point at
53- [ ` amdgpu-install ` ] ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html ) .
56+ [ ` amdgpu-install ` ] ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-overview.html )
57+ (Linux) or the [ HIP SDK installer] ( https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html )
58+ (Windows).
5459- Pure performance complaints. Those belong in ` mi-tuner ` /
5560 ` omniperf-tune ` / ` apu-memory-tuner ` .
61+ - ** WSL2** (running Linux on top of Windows). The ROCm-on-WSL flow needs
62+ Adrenalin Pro plus the WSL kernel update on the Windows host -- those
63+ failure modes are not in this catalog. ` examine.py ` detects WSL via
64+ ` /proc/version ` and exits 2 with a route-out message; if the user wants
65+ WSL specifically, point them at < https://rocm.docs.amd.com/projects/install-on-wsl/en/latest/ > .
5666
5767## Prerequisites
5868
59- - ** OS:** Linux. Phase 0 is Linux-only; on Windows, the HIP SDK / Adrenalin
60- path is its own ecosystem and this skill cannot help.
61- - ** Tools the agent will invoke as part of examination** (best-effort; the
62- script degrades when one is missing):
69+ - ** OS:** Linux ** or** Windows (native). The catalog has 12 Linux entries
70+ (5 of which are also valid on Windows) and 3 Windows-only entries; the
71+ scripts pick the right subset for the host they run on.
72+ - ** Linux tools the agent will invoke as part of examination** (best-effort;
73+ the script degrades when one is missing):
6374 - ` lspci ` (always present on desktop distros)
6475 - ` rocminfo ` (when ROCm is installed)
6576 - ` journalctl ` or ` dmesg ` (for amdgpu kernel-ring evidence)
6677 - ` python ` / ` python3 ` to introspect PyTorch
6778 - ` llama-cli ` / ` llama-server ` / ` main ` to introspect llama.cpp
79+ - ** Windows tools the agent will invoke as part of examination** :
80+ - ` powershell ` (always present on Windows 10+) for `Get-CimInstance
81+ Win32_VideoController` / ` Win32_Processor` and the env-scope reads.
82+ - ` hipInfo.exe ` from ` %HIP_PATH%\bin ` -- the Windows analog of ` rocminfo ` .
83+ Absence is itself a signal (see ` fix-13-hip-sdk-missing ` ).
84+ - ` setx ` for env-var persistence and User-PATH edits (analog of editing
85+ ` ~/.bashrc ` on Linux).
86+ - ` python ` to introspect PyTorch.
6887- ** Permissions:** examination is fully read-only and works as a regular
69- user. Several fixes need ` sudo ` (the recipe metadata flags this); the
70- script always prints the command before asking for consent.
88+ user on both OSes. Linux fixes that need ` sudo ` are flagged in their
89+ recipe metadata; Windows fixes that touch the Machine env scope are
90+ flagged similarly and ` apply_fix.py ` does NOT self-elevate -- the user
91+ has to run an Administrator PowerShell when those are required.
7192
7293Silent footguns to surface explicitly when relevant:
7394
7495- ` HSA_OVERRIDE_GFX_VERSION ` -- forcing an unsupported gfx target works
75- for ` rocminfo ` but causes page faults at runtime. Diagnosis
96+ for ` rocminfo ` / ` hipInfo ` but causes page faults at runtime. Diagnosis
7697 ` fix-2-unset-override ` is the response when this is set on a GPU that
77- already has a native wheel.
98+ already has a native wheel; on Windows it can be persisted in either
99+ the User or Machine env scope, so check both.
78100- ` HIP_VISIBLE_DEVICES ` -- on dual-GPU systems (APU + dGPU) the iGPU is
79101 often index 0 and destabilises HIP unless explicitly hidden.
102+ - ` HIP_PATH ` (Windows) -- if the user has multiple HIP SDK versions
103+ installed under ` C:\Program Files\AMD\ROCm\ ` , ` HIP_PATH ` decides which
104+ one PyTorch / hipInfo actually loads. Pointing it at the wrong major
105+ produces the same failure mode as ` fix-8-wheel-rocm ` .
80106- ` PYTORCH_ROCM_ARCH ` -- only honored during a * build* of PyTorch. Setting
81107 it at runtime does nothing for a prebuilt wheel.
82- - ` LD_LIBRARY_PATH ` -- a wheel-bundled ` libamdhip64.so ` shadowed by a
83- system one (or vice versa) gives confusing `cannot open shared object
84- file` errors that look like fix-8 but are really a load-order bug.
108+ - ` LD_LIBRARY_PATH ` (Linux) -- a wheel-bundled ` libamdhip64.so ` shadowed
109+ by a system one (or vice versa) gives confusing `cannot open shared
110+ object file` errors that look like fix-8 but are really a load-order
111+ bug. The Windows analog is ` PATH ` order: a stale HIP SDK bin directory
112+ earlier on PATH than the one matching ` HIP_PATH ` .
85113
86114## The three-step flow
87115
@@ -114,7 +142,7 @@ script pick. Exit codes:
114142| Exit | Meaning | Next action |
115143| ---| ---| ---|
116144| 0 | Examined; AMD GPU present | Continue to Step 2. |
117- | 2 | Not Linux / no AMD GPU | Stop. Route the user. |
145+ | 2 | Wrong platform (WSL, neither Linux nor Windows, no AMD GPU) | Stop. Route the user. |
118146| 3 | Probes partially failed | Continue but warn the user. |
119147
120148For a quick read-only summary without piping JSON, drop ` --json ` :
@@ -123,13 +151,16 @@ For a quick read-only summary without piping JSON, drop `--json`:
123151python scripts/examine.py --framework pytorch
124152```
125153
126- ` examine.py ` collects exactly the facts the diagnosis catalog needs:
127- OS / kernel, AMD GPUs and gfx targets, ` amdgpu ` / ` amdkfd ` status,
128- ` /dev/kfd ` ownership and group, user's group membership, system ROCm
129- version and install method, framework version and arch list, the
154+ ` examine.py ` collects exactly the facts the diagnosis catalog needs.
155+ On Linux: OS / kernel, AMD GPUs and gfx targets, ` amdgpu ` / ` amdkfd `
156+ status, ` /dev/kfd ` ownership and group, user's group membership, system
157+ ROCm version and install method, framework version and arch list, the
130158silent-footgun env vars, container/IOMMU state, and recent ` amdgpu `
131- kernel log lines. It deliberately does NOT spawn heavy probes (no kernel
132- launches, no model downloads).
159+ kernel log lines. On Windows: AMD adapters and gfx targets via
160+ ` Win32_VideoController ` + ` hipInfo.exe ` , the HIP SDK install path and
161+ version, the Adrenalin / kernel-mode driver version, MSVC redistributable
162+ presence, and the same env-var snapshot. It deliberately does NOT spawn
163+ heavy probes (no kernel launches, no model downloads).
133164
134165### Step 2: diagnose
135166
@@ -204,16 +235,18 @@ useless probes against the host.
204235
205236| Framework | Examine the host? | Action |
206237| ---| ---| ---|
207- | PyTorch | Yes | ` python scripts/examine.py --framework pytorch ` , then ` diagnose.py ` . |
208- | llama.cpp (built against system ROCm) | Yes | ` python scripts/examine.py --framework llama-cpp ` , then ` diagnose.py ` . |
238+ | PyTorch (Linux ROCm wheel) | Yes | ` python scripts/examine.py --framework pytorch ` , then ` diagnose.py ` . |
239+ | PyTorch (Windows TheRock wheel) | Yes | Same scripts; on Windows the catalog filters to Linux+Windows + Windows-only entries. |
240+ | llama.cpp (built against system ROCm/HIP SDK) | Yes | ` python scripts/examine.py --framework llama-cpp ` , then ` diagnose.py ` . |
209241| Lemonade | No -- ships its own ROCm | Route to < https://github.com/lemonade-sdk/lemonade/issues > and the Lemonade [ Discord] ( https://discord.gg/5xXzkMu8Zk ) . |
210242| LM Studio | No -- ships its own runtime | Route to < https://lmstudio.ai/docs/app > (in-app support; no public repo). |
211243| Ollama | No -- ships its own runtime | Route to < https://github.com/ollama/ollama/issues > and the Ollama Discord. |
212244| vLLM / SGLang | Out of scope until phase 1+ | Route to the project's own issue tracker. |
213245
214246If a Lemonade / LM Studio / Ollama user * does* have a host-level ROCm
215247problem (rare), it shows up when their app fails AND a standalone
216- ` rocminfo ` also fails. Only then escalate to the full examination.
248+ ` rocminfo ` (Linux) / ` hipInfo.exe ` (Windows) also fails. Only then
249+ escalate to the full examination.
217250
218251## Safety rules
219252
0 commit comments