feat: AMD Multi-GPU Support#750
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
Audit Review — AMD Multi-GPU Support
Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.
Two bugs to fix, two things to verify:
Bug 1: Bash evaluation order in _gpu_status power reading
if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:
if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]Bug 2: Dead jq expression in _gpu_reassign auto-mode
The llama_indices assignment uses a jq filter that always returns empty string:
llama_indices=$(echo "$assignment_json" | jq -r '
[.gpu_assignment.services.llama_server.gpus[]] as $uuids |
"" ')The actual index extraction happens in the while loop below. Remove the dead code.
Verify: compute_subset rank_matrix indexing
In the APU+dGPU hybrid path (assign_gpus.py ~line 488):
discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.
Verify: Old multigpu.yml filename references
The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.
Non-blocking notes (follow-up material)
gpu-database.json: RX 7900 XTX/XT/GRE share device_id0x744c— matcher must prioritizename_patternsoverdevice_ids- Missing schema entries for
ROCR_VISIBLE_DEVICES,VIDEO_GID,RENDER_GID,HSA_OVERRIDE_GFX_VERSION _gpu_reassigninteractive prompts have no--yesflag — will hang in non-interactive pipelines- No tests for 0-GPU and 1-GPU edge cases,
_detect_topo_sysfsfallback, or mixed APU+dGPU in the API detailed view
Good
- 3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
- Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
- All 5 CLI GPU commands are now vendor-aware
- Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
- Docker Compose overlays follow existing patterns (
ROCR_VISIBLE_DEVICESparallelsNVIDIA_VISIBLE_DEVICES) resolve-compose-stack.shcorrectly constructs vendor-specific overlay filenames- All CI failures are pre-existing on main, none caused by this PR
|
Hey! Thank you for the thorough review and for the kind words, I really appreciate it! Fixed bug 1 (power reading eval order): Fixed Bug 2 (dead llama_indices): Verified compute_subset rank_matrix indexing: Verified old multigpu.yml filename references: Non-blocking notes: Thanks again for the review! |
Lightheartdevs
left a comment
There was a problem hiding this comment.
Audit: NEEDS DISCUSSION — scope and existing review
This is a substantial PR (+2,866 lines) touching installer phases, CLI, dashboard API, compose overlays, and tests. The architecture looks sound:
- Multi-backend topology detection (amd-smi, rocm-smi, sysfs fallback) with proper fallback chains
ROCR_VISIBLE_DEVICESfor GPU isolation in compose overlaysGROUP_ADDfor video/render GIDs is correct for AMD GPU access- Impressive test coverage (25 BATS + 23 shell + 16 pytest)
- No
shell=Truein subprocess calls
Concerns:
- The
multigpu→multigpu-nvidiarename is a breaking change if any automation references the old filename — verifyresolve-compose-stack.shhandles it - Several AMD GPUs share
device_ids(e.g., RX 7900 XTX and XT both have0x744c) — the consumer code must usename_patternsto disambiguate. Verify matching priority. amd-topo.shparses sysfs and amd-smi output — needs review for command injection in the parsing paths- XCP virtual card filtering (MI300X) relies on empty vendor file — hardware-specific, hard to test without real MI300X
Status: Already has changes requested. The outstanding review comments should be addressed before further deep audit. The core architecture is solid but the blast radius warrants careful incremental review.
9a834bb to
83bbb0c
Compare
|
Thanks for the thorough review! I did some adjustments, and here's where each concern lands after verification, with fixes in the latest commit.
|
…se overlays, CLI, and tests
… integer comparison
…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example
…jection and pcie_width parsing in amd-topo
83bbb0c to
0aaf257
Compare
Lightheartdevs
left a comment
There was a problem hiding this comment.
Appreciate the scope — end-to-end AMD multi-GPU is a real gap and the work is substantial: topology library, GPU assignment, compose overlays, CLI commands, dashboard monitoring, plus real-hardware testing on 4x MI300X. The architecture (three detection backends with fallback chain, link-ranking 0-100, per-service GPU isolation via ROCR_VISIBLE_DEVICES) is thoughtful.
Blocking issues:
-
Merge state is CONFLICTING. Needs rebase on main before any meaningful review. The file list shows overlap with multiple recent PRs (
.env.schema.json,detection.sh,resolve-compose-stack.sh,dream-cli,routers/gpu.py) — conflict likely non-trivial. -
Breaking rename:
docker-compose.multigpu.yml→docker-compose.multigpu-nvidia.yml. This affects:- Any existing install's
.compose-flagscache that lists the old filename docker-compose.multigpu.ymlexists at root and is referenced byresolve-compose-stack.sh:138via(script_dir / "docker-compose.multigpu.yml").exists()- Documentation that references the old name
Need either: (a) a migration step in
scripts/that rewrites.compose-flagson first run, or (b) keep the old name as a symlink/alias for a release cycle. Right now this will break any existing multi-GPU NVIDIA install on update. - Any existing install's
-
No CI coverage for AMD multi-GPU. The BATS/pytest tests are excellent for the pure-Python/shell pieces, but there's no smoke test that the overlay resolution actually produces a working compose stack. The
test-installer-contracts.shadditions (+87 lines) help but don't validate runtime. Consider a dry-run smoke job in.github/workflows/that asserts the resolved compose is validdocker compose configoutput. -
LEMONADE_LLAMACPPenv var — please add to.env.schema.jsonwith a description and allowed values. I see it's used in the compose overlay but don't see a schema entry. -
First-time contributor, large feature. I don't know if that's a blocker for this repo, but please confirm there's a maintainer committed to ongoing AMD multi-GPU support. The maintenance surface (three detection backends, fixture refresh on new ROCm versions, MI300X XCP virtual card handling) is real.
Non-blocking notes:
- Link ranking system is smart — XGMI > same-switch PCIe > cross-NUMA. Consider documenting the ranking numbers in
dream-server/docs/so future changes don't silently invert them. - Fixture files from real hardware (
tests/fixtures/amd/*) are excellent. Please note which ROCm/amd-smi version they were captured from, so they can be refreshed when the tools evolve. ROCR_VISIBLE_DEVICESper-service isolation is the right call over relying on the llama.cpp--split-modealone.
Happy to re-review once the rebase + the rename migration + the LEMONADE_LLAMACPP schema are addressed. The core work is solid — don't want it to bitrot. Thanks for the contribution.
feat: AMD Multi-GPU Support
End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.
Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.
What was added
Hardware Detection
vendor=0x1002cards)Topology Detection
Installer Integration
.envonly when applicableDocker Compose Overlays
--split-modepassed via--llamacpp-args(Lemonade's official mechanism)LEMONADE_LLAMACPPenv var (compatible with both Python and C++ Lemonade builds)ROCR_VISIBLE_DEVICESmultigputomultigpu-nvidiaCLI (
dream gpucommands)status,topology,assign,reassign,monitor) are AMD-awareLLAMA_SERVER_GPU_INDICESand per-service*_GPU_INDEXDashboard API
GPU_ASSIGNMENT_JSON_B64Environment & Schema
LLAMA_SERVER_GPU_INDICES— comma-separated GPU indices forROCR_VISIBLE_DEVICESCOMFYUI_GPU_INDEX,WHISPER_GPU_INDEX,EMBEDDINGS_GPU_INDEX— per-service GPU indexLLAMA_ARG_SPLIT_MODE,LLAMA_ARG_TENSOR_SPLIT— llama.cpp multi-GPU parameters.env.schema.jsonand.env.exampleTests
Files Changed
installers/lib/amd-topo.sh(new)installers/lib/detection.sh,installers/phases/02-detection.shinstallers/phases/03-features.sh,installers/phases/06-directories.sh,installers/phases/10-amd-tuning.shdocker-compose.multigpu-amd.yml(new),docker-compose.multigpu-nvidia.yml(renamed)extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml(new), NVIDIA renameddream-cliscripts/assign_gpus.py,scripts/resolve-compose-stack.shextensions/services/dashboard-api/gpu.pyconfig/gpu-database.json,.env.schema.json,.env.exampletests/bats-tests/amd-topo.bats,tests/test-amd-topo.sh,tests/fixtures/amd/*,extensions/services/dashboard-api/tests/test_gpu_amd.py