[Skill proposal] `rocm-doctor`

# `rocm-doctor` skill

## Skill metadata

| Field | Value |
|---|---|
| Proposed name | `rocm-doctor` |
| Already exists? | No, new |
| Catalog area | Hardware-native skills |
| Location | Path A (incubated here; may graduate to a product repo later) |

## One-sentence outcome

Given a user complaint of the form *"ROCm/PyTorch/llama.cpp isn't working on my AMD GPU"*, identify which of a fixed list of **known misconfigurations** is the cause and either fix it or hand the user a precise next step.

## Scope

### In scope

- Detect the user's setup: OS / distro / kernel, GPU `gfx` target, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).
- Identify a fixed list of **known misconfigurations** (see next section) and either fix them or hand back exact next steps.
- Apply low-risk fixes on the user's behalf with consent (group membership, environment variables, `PATH`, reinstall a wheel from the
  right index, etc.).
- Route the user to the correct upstream channel when the issue is owned by another project (Lemonade, LM Studio, Ollama, vLLM, ...).

### Out of scope

- Deep-diving into internal ROCm bugs (kernel crashes, compiler ICEs, miscompiled kernels). Those belong on the relevant ROCm repo.
- Performance tuning. That's `mi-tuner`, `omniperf-tune`, `apu-memory-tuner`.
- Porting code (CUDA to HIP). That's `cuda-to-hip`.
- Holding a copy of any support matrix in the skill body. Matrices go stale
  fast; always fetch live from `rocm.docs.amd.com` at run time.
- Touching BIOS, flashing firmware, or auto-rebooting.
- Installing ROCm from scratch on a clean machine. That is a setup task,
  not a doctor task; point at the official installer instead.

## Device support, phased

| Phase | GPUs |
|---|---|
| 0 | Ryzen AI APUs (Strix Halo, Strix Point, Krackan, Phoenix, Hawk Point) |
| 1 | Instinct (Latest MI GPUs) |
| 2 | Radeon dGPUs |

## Framework support, phased

| Framework | Phase |
|---|---|
| PyTorch | 0 |
| llama.cpp | 0 |
| Lemonade | 0 |
| LM Studio | 0 |
| Ollama | 0 |
| vLLM | 1+ |
| SGLang | 1+ |

## High-level flow

```
[ ] 1. Identify the framework (ask if unclear).
[ ] 2. Decide whether system examination is needed for that framework.
       - Lemonade / LM Studio / Ollama -> skip, route to upstream.
       - Everything else -> continue.
[ ] 3. Run system examination (read-only).
[ ] 4. Match the symptoms against the known-misconfiguration list.
[ ] 5. Propose the fix; ask before applying anything that changes state.
[ ] 6. Re-verify after the fix; mark complete only when the originally
       failing command now succeeds.
```

Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.

## What the system examination collects

- OS name, distro, kernel version.
- AMD GPU(s) present, each one's `gfx` target.
- `amdgpu` kernel module status, `/dev/kfd` ownership and mode, user's
  group memberships.
- System ROCm version (if any) and how it was installed (`amdgpu-install`, distro packages, pip wheel, none).
- Framework version (PyTorch's `torch.version.hip`, llama.cpp build info, etc.) and which ROCm it links against.
- Relevant environment variables: `HSA_OVERRIDE_GFX_VERSION`, `HIP_VISIBLE_DEVICES`, `ROCM_PATH`, `PYTORCH_ROCM_ARCH`, `LD_LIBRARY_PATH`.

This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.

## Known misconfigurations to catch

A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.

1. GPU `gfx` target not in the framework's build arch list.
2. `HSA_OVERRIDE_GFX_VERSION` set on a GPU that now has a native wheel.
3. ROCm version, Linux distro, and kernel form an unsupported triple.
4. User not in `render` / `video` groups, or `/dev/kfd` owned by the
   other group.
5. `amdgpu` kernel module not loaded or blacklisted.
6. ROCm binaries not on `PATH` after install.
7. Stale or conflicting APT / DNF repos from prior installer runs.
8. Framework wheel built for a different ROCm major than the system has.
9. iGPU enumerated alongside a dGPU and crashing the runtime.
10. Container can't see `/dev/kfd` or `/dev/dri/renderD*`.
11. Multi-GPU hang on systems with IOMMU enabled.
12. `amdgpu-install` left a broken state (e.g. `--accept-eula` repo
    regression, partial DKMS).

## Adjacent problem: ROCm facts live in hand-written tables

Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.

The real fix is bigger than this skill: ROCm wants a **single, agent-friendly source of truth** that feeds both the docs and skills like `rocm-doctor`. Until that exists, we scrape `rocm.docs.amd.com` at run time. It's a workaround, not the endgame.

## Why a skill, not a doc?

The information needed to debug a ROCm install is mostly already on [rocm.docs.amd.com](https://rocm.docs.amd.com) and in dozens of GitHub issues. What's missing is the **opinionated decision tree** a senior AMD engineer runs in their head: *which* check to run first, *which* fix to propose, *which* questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Skill proposal] `rocm-doctor` #14

`rocm-doctor` skill

Skill metadata

One-sentence outcome

Scope

In scope

Out of scope

Device support, phased

Framework support, phased

High-level flow

What the system examination collects

Known misconfigurations to catch

Adjacent problem: ROCm facts live in hand-written tables

Why a skill, not a doc?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Proposed name	`rocm-doctor`
Already exists?	No, new
Catalog area	Hardware-native skills
Location	Path A (incubated here; may graduate to a product repo later)

Phase	GPUs
0	Ryzen AI APUs (Strix Halo, Strix Point, Krackan, Phoenix, Hawk Point)
1	Instinct (Latest MI GPUs)
2	Radeon dGPUs

Uh oh!

[Skill proposal] rocm-doctor #14

Description

rocm-doctor skill

Skill metadata

One-sentence outcome

Scope

In scope

Out of scope

Device support, phased

Framework support, phased

High-level flow

What the system examination collects

Known misconfigurations to catch

Adjacent problem: ROCm facts live in hand-written tables

Why a skill, not a doc?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Skill proposal] `rocm-doctor` #14

`rocm-doctor` skill