feat: AMD Multi-GPU Support by y-coffee-dev · Pull Request #750 · Light-Heart-Labs/DreamServer

y-coffee-dev · 2026-04-03T01:30:38Z

feat: AMD Multi-GPU Support

End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.

Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.

What was added

Hardware Detection

Multi-GPU AMD detection via sysfs (counts all vendor=0x1002 cards)
Handles XCP virtual cards on MI300X (filters by non-empty vendor file)
Total VRAM aggregation across all detected AMD GPUs
Mixed APU + discrete GPU classification

Topology Detection

Full AMD topology library with three detection backends: amd-smi JSON, rocm-smi text, sysfs NUMA/IOMMU fallback
Inter-GPU link classification: XGMI, PCIe-SameSwitch, PCIe-HostBridge, PCIe-CrossNUMA
Link ranking system (0–100) for topology-aware GPU assignment
Per-GPU metadata: render node, GFX version, PCI BDF, VRAM, memory type (unified/discrete)
GPU identification with three fallback methods: amd-smi UUID, sysfs unique_id, composite PCI BDF

Installer Integration

AMD topology detection phase (runs when GPU_COUNT > 1 and backend is AMD)
Vendor-aware GPU assignment extraction
AMD multi-GPU env vars written to .env only when applicable
Render node verification in AMD tuning phase

Docker Compose Overlays

AMD multi-GPU overlay for llama-server with Lemonade passthrough
--split-mode passed via --llamacpp-args (Lemonade's official mechanism)
ROCm backend selected via LEMONADE_LLAMACPP env var (compatible with both Python and C++ Lemonade builds)
Per-service GPU isolation via ROCR_VISIBLE_DEVICES
Renamed existing NVIDIA overlays from generic multigpu to multigpu-nvidia

CLI (`dream gpu` commands)

All five GPU commands (status, topology, assign, reassign, monitor) are AMD-aware
AMD GPU status table with VRAM, utilization, temperature, power via amd-smi
AMD topology display with GFX versions, memory types, and render nodes
AMD GPU reassignment writes LLAMA_SERVER_GPU_INDICES and per-service *_GPU_INDEX

Dashboard API

AMD GPU monitoring via amd-smi and sysfs hwmon
Per-GPU metrics: utilization, VRAM, temperature, power, fan speed
GPU assignment decoding from GPU_ASSIGNMENT_JSON_B64

Environment & Schema

LLAMA_SERVER_GPU_INDICES — comma-separated GPU indices for ROCR_VISIBLE_DEVICES
COMFYUI_GPU_INDEX, WHISPER_GPU_INDEX, EMBEDDINGS_GPU_INDEX — per-service GPU index
LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT — llama.cpp multi-GPU parameters
All new env vars added to .env.schema.json and .env.example

Tests

25 BATS unit tests for amd-topo.sh (render node, GFX version, GPU name, GPU ID, topology parsing)
23 shell integration tests with fixture files (4-GPU XGMI, 2-GPU PCIe, field-based select, cross-tool agreement)
16 pytest tests for dashboard-api AMD GPU monitoring
Fixture files extracted from real hardware for amd-smi JSON and rocm-smi text output
Real hardware tests on 4x AMD Instinct MI300X

Files Changed

Area	Files
Topology detection	`installers/lib/amd-topo.sh` (new)
Hardware detection	`installers/lib/detection.sh`, `installers/phases/02-detection.sh`
Installer	`installers/phases/03-features.sh`, `installers/phases/06-directories.sh`, `installers/phases/10-amd-tuning.sh`
Compose overlays	`docker-compose.multigpu-amd.yml` (new), `docker-compose.multigpu-nvidia.yml` (renamed)
Service overlays	`extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml` (new), NVIDIA renamed
CLI	`dream-cli`
Assignment	`scripts/assign_gpus.py`, `scripts/resolve-compose-stack.sh`
Dashboard API	`extensions/services/dashboard-api/gpu.py`
Config	`config/gpu-database.json`, `.env.schema.json`, `.env.example`
Tests	`tests/bats-tests/amd-topo.bats`, `tests/test-amd-topo.sh`, `tests/fixtures/amd/*`, `extensions/services/dashboard-api/tests/test_gpu_amd.py`

Lightheartdevs

Audit Review — AMD Multi-GPU Support

Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.

Two bugs to fix, two things to verify:

Bug 1: Bash evaluation order in _gpu_status power reading

if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]

The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:

if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]

Bug 2: Dead jq expression in _gpu_reassign auto-mode

The llama_indices assignment uses a jq filter that always returns empty string:

llama_indices=$(echo "$assignment_json" | jq -r '
    [.gpu_assignment.services.llama_server.gpus[]] as $uuids |
    "" ')

The actual index extraction happens in the while loop below. Remove the dead code.

Verify: compute_subset rank_matrix indexing

In the APU+dGPU hybrid path (assign_gpus.py ~line 488):

discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)

Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.

Verify: Old multigpu.yml filename references

The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.

Non-blocking notes (follow-up material)

gpu-database.json: RX 7900 XTX/XT/GRE share device_id 0x744c — matcher must prioritize name_patterns over device_ids
Missing schema entries for ROCR_VISIBLE_DEVICES, VIDEO_GID, RENDER_GID, HSA_OVERRIDE_GFX_VERSION
_gpu_reassign interactive prompts have no --yes flag — will hang in non-interactive pipelines
No tests for 0-GPU and 1-GPU edge cases, _detect_topo_sysfs fallback, or mixed APU+dGPU in the API detailed view

Good

3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
All 5 CLI GPU commands are now vendor-aware
Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
Docker Compose overlays follow existing patterns (ROCR_VISIBLE_DEVICES parallels NVIDIA_VISIBLE_DEVICES)
resolve-compose-stack.sh correctly constructs vendor-specific overlay filenames
All CI failures are pre-existing on main, none caused by this PR

y-coffee-dev · 2026-04-03T02:24:30Z

Hey! Thank you for the thorough review and for the kind words, I really appreciate it!

Fixed bug 1 (power reading eval order):
Great catch. You're right that -eq on a non-numeric value would cause issues before the regex guard gets a chance to run. In practice pw_uw is initialized to 0 and reset on cat failure, so sysfs would have to return a non-numeric string successfully for this to trigger, which is unlikely but not impossible. I swapped the order so the regex check runs first just in case.

Fixed Bug 2 (dead llama_indices):
Yep, that was a leftover from an earlier approach. The jq expression [...] as $uuids | "" literally evaluates to an empty string, and the variable is never referenced after assignment. I removed it.

Verified compute_subset rank_matrix indexing:
I went through this carefully, compute_subset() at line 130 does indices = [g.index for g in gpus] and then uses those indices for rank_matrix lookups via get_rank(rank_matrix, a, b). So when we filter to discrete-only GPUs, it's still looking up ranks by their original topology indices, not by list position, so we're good.

Verified old multigpu.yml filename references:
I grepped the full codebase. The old docker-compose.multigpu.yml / compose.multigpu.yaml do not appear in any code files, no references in dream-cli, resolve-compose-stack.sh, tests, or any production code. All good on this as well.

Non-blocking notes:
All fair observations. I went ahead and addressed the schema one: ROCR_VISIBLE_DEVICES was the only entry actually missing from .env.schema.json, VIDEO_GID, RENDER_GID, and HSA_OVERRIDE_GFX_VERSION were already there, I just improved their descriptions and added defaults. Also updated .env.example with all four AMD-specific vars.

Thanks again for the review!

Lightheartdevs

Audit: NEEDS DISCUSSION — scope and existing review

This is a substantial PR (+2,866 lines) touching installer phases, CLI, dashboard API, compose overlays, and tests. The architecture looks sound:

Multi-backend topology detection (amd-smi, rocm-smi, sysfs fallback) with proper fallback chains
ROCR_VISIBLE_DEVICES for GPU isolation in compose overlays
GROUP_ADD for video/render GIDs is correct for AMD GPU access
Impressive test coverage (25 BATS + 23 shell + 16 pytest)
No shell=True in subprocess calls

Concerns:

The multigpu → multigpu-nvidia rename is a breaking change if any automation references the old filename — verify resolve-compose-stack.sh handles it
Several AMD GPUs share device_ids (e.g., RX 7900 XTX and XT both have 0x744c) — the consumer code must use name_patterns to disambiguate. Verify matching priority.
amd-topo.sh parses sysfs and amd-smi output — needs review for command injection in the parsing paths
XCP virtual card filtering (MI300X) relies on empty vendor file — hardware-specific, hard to test without real MI300X

Status: Already has changes requested. The outstanding review comments should be addressed before further deep audit. The core architecture is solid but the blast radius warrants careful incremental review.

y-coffee-dev · 2026-04-05T05:01:13Z

Thanks for the thorough review! I did some adjustments, and here's where each concern lands after verification, with fixes in the latest commit.

`multigpu` to `multigpu-nvidia` rename

Verified. The old files existed on main and were renamed (identical content). resolve-compose-stack.sh was updated to construct filenames dynamically per backend (f"docker-compose.multigpu-{gpu_backend}.yml"). dream-cli has zero references to any multigpu filename. Updates go through git pull which handles renames transparently.

Shared `device_id` `0x744c` (XTX / XT / GRE) - fixed

Valid concern, and it was introduced by this PR's new gpu-database.json entries. I traced through the matching logic in classify-hardware.sh and found two issues:

Substring collision: "RX 7900 XT" is a substring of "RX 7900 XTX". The old matcher broke on the first dual match, making JSON entry ordering load-bearing. Fixed by tracking the longest matched pattern. "RX 7900 XTX" (13 chars) now wins over "RX 7900 XT" (10 chars) regardless of entry order.

Empty name fallback: When sysfs product_name is unavailable, the old code fell back to the first device_id match (always XTX). Fixed by using VRAM proximity as a tiebreaker. The classifier already receives --vram-mb, so a 20 GB card now resolves to XT and a 16 GB card to GRE. When VRAM is also 0, it picks the smallest matching card because under-provisioning is safer (smaller model runs fine), over-provisioning may crash the model loader.

Added 35 test assertions to test-installer-contracts.sh covering the happy path as well as substring safety (XT must not match XTX and vice versa), tier/bandwidth correctness, VRAM tiebreaker with exact and approximate values, zero-VRAM fallback, the 0x7480 pair, name-only matches, and unknown GPU heuristic fallback.

Command injection in parsing paths

Walked every external data ingestion point in amd-topo.sh. No injection vector exists. All sysfs reads use kernel-controlled paths, jq calls use integer loop counters, awk calls use -v parameter passing, PCI BDFs are regex-filtered, zero uses of eval. Python side: zero shell=True in entire dashboard-api/.

Found one inconsistency: $vram_bytes was interpolated directly into an awk program string while the rest of the file used -v. Fixed it in this commit. Also added a missing grep -oP '^\d+' filter on pcie_width to match pcie_gen's existing filter. MI300X fixture data confirmed current_link_width can be "0" and current_link_speed can be "Unknown".

XCP virtual card filtering (MI300X)

Confirmed using real MI300X fixture data. The 4x MI300X system exposes 32 card entries: 4 real GPUs (card0, card8, card16, card24) with vendor=0x1002, and 28 XCP virtual partitions with empty vendor files. The standard vendor == "0x1002" check naturally filters them. All three code paths (detection.sh, amd-topo.sh, gpu.py) use the same mechanism. Python path is tested in test_gpu_amd.py, bash paths use identical logic. Already validated on real MI300X hardware.

Thanks again for the review!

…se overlays, CLI, and tests

… integer comparison

…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example

…jection and pcie_width parsing in amd-topo

Lightheartdevs

Appreciate the scope — end-to-end AMD multi-GPU is a real gap and the work is substantial: topology library, GPU assignment, compose overlays, CLI commands, dashboard monitoring, plus real-hardware testing on 4x MI300X. The architecture (three detection backends with fallback chain, link-ranking 0-100, per-service GPU isolation via ROCR_VISIBLE_DEVICES) is thoughtful.

Blocking issues:

Merge state is CONFLICTING. Needs rebase on main before any meaningful review. The file list shows overlap with multiple recent PRs (.env.schema.json, detection.sh, resolve-compose-stack.sh, dream-cli, routers/gpu.py) — conflict likely non-trivial.
Breaking rename: docker-compose.multigpu.yml → docker-compose.multigpu-nvidia.yml. This affects:
- Any existing install's .compose-flags cache that lists the old filename
- docker-compose.multigpu.yml exists at root and is referenced by resolve-compose-stack.sh:138 via (script_dir / "docker-compose.multigpu.yml").exists()
- Documentation that references the old name
Need either: (a) a migration step in scripts/ that rewrites .compose-flags on first run, or (b) keep the old name as a symlink/alias for a release cycle. Right now this will break any existing multi-GPU NVIDIA install on update.
No CI coverage for AMD multi-GPU. The BATS/pytest tests are excellent for the pure-Python/shell pieces, but there's no smoke test that the overlay resolution actually produces a working compose stack. The test-installer-contracts.sh additions (+87 lines) help but don't validate runtime. Consider a dry-run smoke job in .github/workflows/ that asserts the resolved compose is valid docker compose config output.
LEMONADE_LLAMACPP env var — please add to .env.schema.json with a description and allowed values. I see it's used in the compose overlay but don't see a schema entry.
First-time contributor, large feature. I don't know if that's a blocker for this repo, but please confirm there's a maintainer committed to ongoing AMD multi-GPU support. The maintenance surface (three detection backends, fixture refresh on new ROCm versions, MI300X XCP virtual card handling) is real.

Non-blocking notes:

Link ranking system is smart — XGMI > same-switch PCIe > cross-NUMA. Consider documenting the ranking numbers in dream-server/docs/ so future changes don't silently invert them.
Fixture files from real hardware (tests/fixtures/amd/*) are excellent. Please note which ROCm/amd-smi version they were captured from, so they can be refreshed when the tools evolve.
ROCR_VISIBLE_DEVICES per-service isolation is the right call over relying on the llama.cpp --split-mode alone.

Happy to re-review once the rebase + the rename migration + the LEMONADE_LLAMACPP schema are addressed. The core work is solid — don't want it to bitrot. Thanks for the contribution.

Lightheartdevs requested changes Apr 3, 2026

View reviewed changes

Lightheartdevs reviewed Apr 4, 2026

View reviewed changes

y-coffee-dev force-pushed the feat/amd-multi-gpu branch 2 times, most recently from 9a834bb to 83bbb0c Compare April 5, 2026 04:58

y-coffee-dev added 6 commits April 8, 2026 18:24

feat: add AMD multi-GPU support end-to-end, topology detection, compo…

714b366

…se overlays, CLI, and tests

fix: remove unused variable and imports flagged by ruff

6f93f26

fix: remove unused llama_indices variable in AMD GPU reassign

5810f8d

fix: swap evaluation order in AMD power reading to check regex before…

9a5b987

… integer comparison

fix: add ROCR_VISIBLE_DEVICES schema entry, improve VIDEO_GID/RENDER_…

d8ddf7a

…GID/HSA_OVERRIDE_GFX_VERSION descriptions, update .env.example

fix: harden GPU classifier matching for shared device_ids, fix awk in…

0aaf257

…jection and pcie_width parsing in amd-topo

y-coffee-dev force-pushed the feat/amd-multi-gpu branch from 83bbb0c to 0aaf257 Compare April 8, 2026 17:24

Lightheartdevs requested changes Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AMD Multi-GPU Support#750

feat: AMD Multi-GPU Support#750
y-coffee-dev wants to merge 6 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu

y-coffee-dev commented Apr 3, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

y-coffee-dev commented Apr 3, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

y-coffee-dev commented Apr 5, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

y-coffee-dev commented Apr 3, 2026

feat: AMD Multi-GPU Support

What was added

Hardware Detection

Topology Detection

Installer Integration

Docker Compose Overlays

CLI (dream gpu commands)

Dashboard API

Environment & Schema

Tests

Files Changed

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Audit Review — AMD Multi-GPU Support

Bug 1: Bash evaluation order in _gpu_status power reading

Bug 2: Dead jq expression in _gpu_reassign auto-mode

Verify: compute_subset rank_matrix indexing

Verify: Old multigpu.yml filename references

Non-blocking notes (follow-up material)

Good

Uh oh!

y-coffee-dev commented Apr 3, 2026

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Uh oh!

y-coffee-dev commented Apr 5, 2026

multigpu to multigpu-nvidia rename

Shared device_id 0x744c (XTX / XT / GRE) - fixed

Command injection in parsing paths

XCP virtual card filtering (MI300X)

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLI (`dream gpu` commands)

`multigpu` to `multigpu-nvidia` rename

Shared `device_id` `0x744c` (XTX / XT / GRE) - fixed