Skip to content

feat: AMD Multi-GPU Support#750

Open
y-coffee-dev wants to merge 6 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu
Open

feat: AMD Multi-GPU Support#750
y-coffee-dev wants to merge 6 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/amd-multi-gpu

Conversation

@y-coffee-dev
Copy link
Copy Markdown
Contributor

feat: AMD Multi-GPU Support

End-to-end multi-GPU support for AMD GPUs, matching the existing NVIDIA multi-GPU feature set.

Previously, AMD support was limited to single-GPU. This branch implements end to end support with hardware discovery, topology analysis, GPU assignment, Docker Compose isolation, CLI management, and monitoring, to handle multiple AMD GPUs.

What was added

Hardware Detection

  • Multi-GPU AMD detection via sysfs (counts all vendor=0x1002 cards)
  • Handles XCP virtual cards on MI300X (filters by non-empty vendor file)
  • Total VRAM aggregation across all detected AMD GPUs
  • Mixed APU + discrete GPU classification

Topology Detection

  • Full AMD topology library with three detection backends: amd-smi JSON, rocm-smi text, sysfs NUMA/IOMMU fallback
  • Inter-GPU link classification: XGMI, PCIe-SameSwitch, PCIe-HostBridge, PCIe-CrossNUMA
  • Link ranking system (0–100) for topology-aware GPU assignment
  • Per-GPU metadata: render node, GFX version, PCI BDF, VRAM, memory type (unified/discrete)
  • GPU identification with three fallback methods: amd-smi UUID, sysfs unique_id, composite PCI BDF

Installer Integration

  • AMD topology detection phase (runs when GPU_COUNT > 1 and backend is AMD)
  • Vendor-aware GPU assignment extraction
  • AMD multi-GPU env vars written to .env only when applicable
  • Render node verification in AMD tuning phase

Docker Compose Overlays

  • AMD multi-GPU overlay for llama-server with Lemonade passthrough
  • --split-mode passed via --llamacpp-args (Lemonade's official mechanism)
  • ROCm backend selected via LEMONADE_LLAMACPP env var (compatible with both Python and C++ Lemonade builds)
  • Per-service GPU isolation via ROCR_VISIBLE_DEVICES
  • Renamed existing NVIDIA overlays from generic multigpu to multigpu-nvidia

CLI (dream gpu commands)

  • All five GPU commands (status, topology, assign, reassign, monitor) are AMD-aware
  • AMD GPU status table with VRAM, utilization, temperature, power via amd-smi
  • AMD topology display with GFX versions, memory types, and render nodes
  • AMD GPU reassignment writes LLAMA_SERVER_GPU_INDICES and per-service *_GPU_INDEX

Dashboard API

  • AMD GPU monitoring via amd-smi and sysfs hwmon
  • Per-GPU metrics: utilization, VRAM, temperature, power, fan speed
  • GPU assignment decoding from GPU_ASSIGNMENT_JSON_B64

Environment & Schema

  • LLAMA_SERVER_GPU_INDICES — comma-separated GPU indices for ROCR_VISIBLE_DEVICES
  • COMFYUI_GPU_INDEX, WHISPER_GPU_INDEX, EMBEDDINGS_GPU_INDEX — per-service GPU index
  • LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT — llama.cpp multi-GPU parameters
  • All new env vars added to .env.schema.json and .env.example

Tests

  • 25 BATS unit tests for amd-topo.sh (render node, GFX version, GPU name, GPU ID, topology parsing)
  • 23 shell integration tests with fixture files (4-GPU XGMI, 2-GPU PCIe, field-based select, cross-tool agreement)
  • 16 pytest tests for dashboard-api AMD GPU monitoring
  • Fixture files extracted from real hardware for amd-smi JSON and rocm-smi text output
  • Real hardware tests on 4x AMD Instinct MI300X

Files Changed

Area Files
Topology detection installers/lib/amd-topo.sh (new)
Hardware detection installers/lib/detection.sh, installers/phases/02-detection.sh
Installer installers/phases/03-features.sh, installers/phases/06-directories.sh, installers/phases/10-amd-tuning.sh
Compose overlays docker-compose.multigpu-amd.yml (new), docker-compose.multigpu-nvidia.yml (renamed)
Service overlays extensions/services/{comfyui,whisper,embeddings}/compose.multigpu-amd.yaml (new), NVIDIA renamed
CLI dream-cli
Assignment scripts/assign_gpus.py, scripts/resolve-compose-stack.sh
Dashboard API extensions/services/dashboard-api/gpu.py
Config config/gpu-database.json, .env.schema.json, .env.example
Tests tests/bats-tests/amd-topo.bats, tests/test-amd-topo.sh, tests/fixtures/amd/*, extensions/services/dashboard-api/tests/test_gpu_amd.py

Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit Review — AMD Multi-GPU Support

Strong PR. The architecture mirrors the NVIDIA multi-GPU system faithfully, the code is well-structured, and the test coverage is excellent (64+ tests across BATS, shell, and pytest with real MI300X hardware fixtures). No security concerns — sysfs reads are guarded, PCI BDF strings are regex-filtered, jq uses numeric variables, no shell injection vectors.

Two bugs to fix, two things to verify:

Bug 1: Bash evaluation order in _gpu_status power reading

if [[ "$pw_uw" -eq 0 || ! "$pw_uw" =~ ^[0-9]+$ ]]

The -eq runs before the regex check. If pw_uw is non-numeric (e.g., sysfs returns an error string), bash throws an integer comparison error. Swap the order:

if [[ ! "$pw_uw" =~ ^[0-9]+$ || "$pw_uw" -eq 0 ]]

Bug 2: Dead jq expression in _gpu_reassign auto-mode

The llama_indices assignment uses a jq filter that always returns empty string:

llama_indices=$(echo "$assignment_json" | jq -r '
    [.gpu_assignment.services.llama_server.gpus[]] as $uuids |
    "" ')

The actual index extraction happens in the while loop below. Remove the dead code.

Verify: compute_subset rank_matrix indexing

In the APU+dGPU hybrid path (assign_gpus.py ~line 488):

discrete_gpus = [g for g in llama_subset.gpus if g.memory_type == "discrete"]
discrete_subset = compute_subset(discrete_gpus, rank_matrix)

Confirm compute_subset uses gpu.index (original topology index) for rank_matrix lookups, not list position. If it uses list position, the filtered subset will produce wrong link rankings.

Verify: Old multigpu.yml filename references

The rename from docker-compose.multigpu.yml to docker-compose.multigpu-nvidia.yml is clean, but grep the full codebase for any hardcoded references to the old filename in dream-cli, docs, or tests.

Non-blocking notes (follow-up material)

  • gpu-database.json: RX 7900 XTX/XT/GRE share device_id 0x744c — matcher must prioritize name_patterns over device_ids
  • Missing schema entries for ROCR_VISIBLE_DEVICES, VIDEO_GID, RENDER_GID, HSA_OVERRIDE_GFX_VERSION
  • _gpu_reassign interactive prompts have no --yes flag — will hang in non-interactive pipelines
  • No tests for 0-GPU and 1-GPU edge cases, _detect_topo_sysfs fallback, or mixed APU+dGPU in the API detailed view

Good

  • 3-backend fallback chain (amd-smi -> rocm-smi -> sysfs) with graceful degradation
  • Hybrid APU+dGPU strategy correctly routes lightweight services to APU, freeing discrete VRAM for LLM
  • All 5 CLI GPU commands are now vendor-aware
  • Tests use real hardware fixtures (4x MI300X XGMI, 2-GPU PCIe) — no flaky mocks
  • Docker Compose overlays follow existing patterns (ROCR_VISIBLE_DEVICES parallels NVIDIA_VISIBLE_DEVICES)
  • resolve-compose-stack.sh correctly constructs vendor-specific overlay filenames
  • All CI failures are pre-existing on main, none caused by this PR

@y-coffee-dev
Copy link
Copy Markdown
Contributor Author

Hey! Thank you for the thorough review and for the kind words, I really appreciate it!

Fixed bug 1 (power reading eval order):
Great catch. You're right that -eq on a non-numeric value would cause issues before the regex guard gets a chance to run. In practice pw_uw is initialized to 0 and reset on cat failure, so sysfs would have to return a non-numeric string successfully for this to trigger, which is unlikely but not impossible. I swapped the order so the regex check runs first just in case.

Fixed Bug 2 (dead llama_indices):
Yep, that was a leftover from an earlier approach. The jq expression [...] as $uuids | "" literally evaluates to an empty string, and the variable is never referenced after assignment. I removed it.

Verified compute_subset rank_matrix indexing:
I went through this carefully, compute_subset() at line 130 does indices = [g.index for g in gpus] and then uses those indices for rank_matrix lookups via get_rank(rank_matrix, a, b). So when we filter to discrete-only GPUs, it's still looking up ranks by their original topology indices, not by list position, so we're good.

Verified old multigpu.yml filename references:
I grepped the full codebase. The old docker-compose.multigpu.yml / compose.multigpu.yaml do not appear in any code files, no references in dream-cli, resolve-compose-stack.sh, tests, or any production code. All good on this as well.

Non-blocking notes:
All fair observations. I went ahead and addressed the schema one: ROCR_VISIBLE_DEVICES was the only entry actually missing from .env.schema.json, VIDEO_GID, RENDER_GID, and HSA_OVERRIDE_GFX_VERSION were already there, I just improved their descriptions and added defaults. Also updated .env.example with all four AMD-specific vars.

Thanks again for the review!

Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit: NEEDS DISCUSSION — scope and existing review

This is a substantial PR (+2,866 lines) touching installer phases, CLI, dashboard API, compose overlays, and tests. The architecture looks sound:

  • Multi-backend topology detection (amd-smi, rocm-smi, sysfs fallback) with proper fallback chains
  • ROCR_VISIBLE_DEVICES for GPU isolation in compose overlays
  • GROUP_ADD for video/render GIDs is correct for AMD GPU access
  • Impressive test coverage (25 BATS + 23 shell + 16 pytest)
  • No shell=True in subprocess calls

Concerns:

  • The multigpumultigpu-nvidia rename is a breaking change if any automation references the old filename — verify resolve-compose-stack.sh handles it
  • Several AMD GPUs share device_ids (e.g., RX 7900 XTX and XT both have 0x744c) — the consumer code must use name_patterns to disambiguate. Verify matching priority.
  • amd-topo.sh parses sysfs and amd-smi output — needs review for command injection in the parsing paths
  • XCP virtual card filtering (MI300X) relies on empty vendor file — hardware-specific, hard to test without real MI300X

Status: Already has changes requested. The outstanding review comments should be addressed before further deep audit. The core architecture is solid but the blast radius warrants careful incremental review.

@y-coffee-dev y-coffee-dev force-pushed the feat/amd-multi-gpu branch 2 times, most recently from 9a834bb to 83bbb0c Compare April 5, 2026 04:58
@y-coffee-dev
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! I did some adjustments, and here's where each concern lands after verification, with fixes in the latest commit.

multigpu to multigpu-nvidia rename

Verified. The old files existed on main and were renamed (identical content). resolve-compose-stack.sh was updated to construct filenames dynamically per backend (f"docker-compose.multigpu-{gpu_backend}.yml"). dream-cli has zero references to any multigpu filename. Updates go through git pull which handles renames transparently.

Shared device_id 0x744c (XTX / XT / GRE) - fixed

Valid concern, and it was introduced by this PR's new gpu-database.json entries. I traced through the matching logic in classify-hardware.sh and found two issues:

Substring collision: "RX 7900 XT" is a substring of "RX 7900 XTX". The old matcher broke on the first dual match, making JSON entry ordering load-bearing. Fixed by tracking the longest matched pattern. "RX 7900 XTX" (13 chars) now wins over "RX 7900 XT" (10 chars) regardless of entry order.

Empty name fallback: When sysfs product_name is unavailable, the old code fell back to the first device_id match (always XTX). Fixed by using VRAM proximity as a tiebreaker. The classifier already receives --vram-mb, so a 20 GB card now resolves to XT and a 16 GB card to GRE. When VRAM is also 0, it picks the smallest matching card because under-provisioning is safer (smaller model runs fine), over-provisioning may crash the model loader.

Added 35 test assertions to test-installer-contracts.sh covering the happy path as well as substring safety (XT must not match XTX and vice versa), tier/bandwidth correctness, VRAM tiebreaker with exact and approximate values, zero-VRAM fallback, the 0x7480 pair, name-only matches, and unknown GPU heuristic fallback.

Command injection in parsing paths

Walked every external data ingestion point in amd-topo.sh. No injection vector exists. All sysfs reads use kernel-controlled paths, jq calls use integer loop counters, awk calls use -v parameter passing, PCI BDFs are regex-filtered, zero uses of eval. Python side: zero shell=True in entire dashboard-api/.

Found one inconsistency: $vram_bytes was interpolated directly into an awk program string while the rest of the file used -v. Fixed it in this commit. Also added a missing grep -oP '^\d+' filter on pcie_width to match pcie_gen's existing filter. MI300X fixture data confirmed current_link_width can be "0" and current_link_speed can be "Unknown".

XCP virtual card filtering (MI300X)

Confirmed using real MI300X fixture data. The 4x MI300X system exposes 32 card entries: 4 real GPUs (card0, card8, card16, card24) with vendor=0x1002, and 28 XCP virtual partitions with empty vendor files. The standard vendor == "0x1002" check naturally filters them. All three code paths (detection.sh, amd-topo.sh, gpu.py) use the same mechanism. Python path is tested in test_gpu_amd.py, bash paths use identical logic. Already validated on real MI300X hardware.

Thanks again for the review!

Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the scope — end-to-end AMD multi-GPU is a real gap and the work is substantial: topology library, GPU assignment, compose overlays, CLI commands, dashboard monitoring, plus real-hardware testing on 4x MI300X. The architecture (three detection backends with fallback chain, link-ranking 0-100, per-service GPU isolation via ROCR_VISIBLE_DEVICES) is thoughtful.

Blocking issues:

  1. Merge state is CONFLICTING. Needs rebase on main before any meaningful review. The file list shows overlap with multiple recent PRs (.env.schema.json, detection.sh, resolve-compose-stack.sh, dream-cli, routers/gpu.py) — conflict likely non-trivial.

  2. Breaking rename: docker-compose.multigpu.ymldocker-compose.multigpu-nvidia.yml. This affects:

    Need either: (a) a migration step in scripts/ that rewrites .compose-flags on first run, or (b) keep the old name as a symlink/alias for a release cycle. Right now this will break any existing multi-GPU NVIDIA install on update.

  3. No CI coverage for AMD multi-GPU. The BATS/pytest tests are excellent for the pure-Python/shell pieces, but there's no smoke test that the overlay resolution actually produces a working compose stack. The test-installer-contracts.sh additions (+87 lines) help but don't validate runtime. Consider a dry-run smoke job in .github/workflows/ that asserts the resolved compose is valid docker compose config output.

  4. LEMONADE_LLAMACPP env var — please add to .env.schema.json with a description and allowed values. I see it's used in the compose overlay but don't see a schema entry.

  5. First-time contributor, large feature. I don't know if that's a blocker for this repo, but please confirm there's a maintainer committed to ongoing AMD multi-GPU support. The maintenance surface (three detection backends, fixture refresh on new ROCm versions, MI300X XCP virtual card handling) is real.

Non-blocking notes:

  • Link ranking system is smart — XGMI > same-switch PCIe > cross-NUMA. Consider documenting the ranking numbers in dream-server/docs/ so future changes don't silently invert them.
  • Fixture files from real hardware (tests/fixtures/amd/*) are excellent. Please note which ROCm/amd-smi version they were captured from, so they can be refreshed when the tools evolve.
  • ROCR_VISIBLE_DEVICES per-service isolation is the right call over relying on the llama.cpp --split-mode alone.

Happy to re-review once the rebase + the rename migration + the LEMONADE_LLAMACPP schema are addressed. The core work is solid — don't want it to bitrot. Thanks for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants