rocm-doctor skill [Phase 0]#37
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
This Skill still needs to go through tests and design iterations before published. We are currently looking for an owner from the ROCm team to own this.
rocm-doctorskillSkill metadata
rocm-doctorOne-sentence outcome
Given a user complaint of the form "ROCm/PyTorch/llama.cpp isn't working on my AMD GPU", identify which of a fixed list of known misconfigurations is the cause and either fix it or hand the user a precise next step.
Scope
In scope
gfxtarget, installed ROCm version, framework version (PyTorch / llama.cpp / etc.).PATH, reinstall a wheel from theright index, etc.).
Out of scope
mi-tuner,omniperf-tune,apu-memory-tuner.cuda-to-hip.fast; always fetch live from
rocm.docs.amd.comat run time.not a doctor task; point at the official installer instead.
Device support, phased
Framework support, phased
High-level flow
Steps 1 through 4 are read-only and cheap. Step 5 is the only step that changes the system, and it always asks first.
What the system examination collects
gfxtarget.amdgpukernel module status,/dev/kfdownership and mode, user'sgroup memberships.
amdgpu-install, distro packages, pip wheel, none).torch.version.hip, llama.cpp build info, etc.) and which ROCm it links against.HSA_OVERRIDE_GFX_VERSION,HIP_VISIBLE_DEVICES,ROCM_PATH,PYTORCH_ROCM_ARCH,LD_LIBRARY_PATH.This is the minimum to disambiguate the 12 failure modes above. Anything
beyond that should justify its tokens.
Known misconfigurations to catch
A closed list. If the user's symptom doesn't match one of these, the skill links to the right upstream tracker rather than guessing.
gfxtarget not in the framework's build arch list.HSA_OVERRIDE_GFX_VERSIONset on a GPU that now has a native wheel.render/videogroups, or/dev/kfdowned by theother group.
amdgpukernel module not loaded or blacklisted.PATHafter install./dev/kfdor/dev/dri/renderD*.amdgpu-installleft a broken state (e.g.--accept-eulareporegression, partial DKMS).
Adjacent problem: ROCm facts live in hand-written tables
Most of what this skill needs (supported GPUs, kernel ranges, ROCm releases, wheel arch lists, gfx families) is scattered across hand-typed tables in docs pages, READMEs, and release notes. Everyone re-parses the same matrix, and they drift.
The real fix is bigger than this skill: ROCm wants a single, agent-friendly source of truth that feeds both the docs and skills like
rocm-doctor. Until that exists, we scraperocm.docs.amd.comat run time. It's a workaround, not the endgame.Why a skill, not a doc?
The information needed to debug a ROCm install is mostly already on rocm.docs.amd.com and in dozens of GitHub issues. What's missing is the opinionated decision tree a senior AMD engineer runs in their head: which check to run first, which fix to propose, which questions to skip (as well as supplemental scripts to make that easy/effective). That tree is what this skill encodes.