Skip to content

Commit c705077

Browse files
committed
docs(rocm): Document HSA_OVERRIDE_GFX_VERSION workaround for integrated GPUs
- Update supported GPU section to clarify integrated GPUs CAN work with override - Add comprehensive HSA_OVERRIDE_GFX_VERSION workaround section - Include override values table for common integrated GPUs (Phoenix1, Renoir, Cezanne) - Update example pipeline to show PyTorch with HSA override - Clarify limitations: suboptimal performance, good for dev/test only - Update troubleshooting to reference workaround instead of "not supported" - Add Docker test command for pre-deployment validation This corrects previous documentation that stated integrated GPUs were completely unsupported. Testing confirms gfx1103 (Phoenix1/780M) works with HSA_OVERRIDE_GFX_VERSION=11.0.0 for PyTorch ROCm compute.
1 parent 6481867 commit c705077

File tree

1 file changed

+92
-8
lines changed

1 file changed

+92
-8
lines changed

README.md

Lines changed: 92 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -647,8 +647,11 @@ lxc config device add juju-abc123-0 kfd unix-char source=/dev/kfd path=/dev/kfd
647647
- Must be added as separate device after GPU passthrough
648648

649649
**⚠️ Supported AMD GPUs:**
650-
- **Discrete GPUs only**: RX 6000/7000 series, Radeon Pro, Instinct MI series
651-
- **NOT supported**: Integrated AMD GPUs (APUs like Phoenix1, Renoir, Cezanne)
650+
- **Discrete GPUs (fully supported)**: RX 6000/7000 series, Radeon Pro, Instinct MI series - work natively
651+
- **Integrated GPUs (requires workaround)**: APUs like Phoenix1 (gfx1103), Renoir, Cezanne
652+
-**CAN work** with `HSA_OVERRIDE_GFX_VERSION` environment variable (see below)
653+
- ⚠️ Lower performance due to shared system memory
654+
- Recommended for development/testing, not production ML workloads
652655
- Check ROCm compatibility: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html
653656

654657
**Everything else is automated!** The charm will:
@@ -713,11 +716,17 @@ jobs:
713716
image_resource:
714717
type: registry-image
715718
source:
716-
repository: rocm/tensorflow
719+
repository: rocm/pytorch
717720
tag: latest
718721
run:
719-
path: python
720-
args: ["-c", "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"]
722+
path: sh
723+
args:
724+
- -c
725+
- |
726+
# For integrated AMD GPUs (Phoenix1/gfx1103, etc.)
727+
export HSA_OVERRIDE_GFX_VERSION=11.0.0
728+
729+
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print('Result:', x * 2)"
721730
```
722731
723732
### Verifying ROCm GPU Access
@@ -739,9 +748,78 @@ fly -t local execute -c test-gpu.yml --tag=rocm
739748

740749
- `rocm/dev-ubuntu-24.04:latest` - ROCm development base (~1.1GB)
741750
- `rocm/tensorflow:latest` - TensorFlow with ROCm
742-
- `rocm/pytorch:latest` - PyTorch with ROCm
751+
- `rocm/pytorch:latest` - PyTorch with ROCm (~6GB, includes PyTorch 2.5.1+rocm6.2)
743752
- `rocm/rocm-terminal:latest` - ROCm with utilities
744753

754+
### HSA_OVERRIDE_GFX_VERSION Workaround for Integrated GPUs
755+
756+
Integrated AMD GPUs (APUs) like Phoenix1 (gfx1103), Renoir, and Cezanne are not officially supported by ROCm, but can work with the `HSA_OVERRIDE_GFX_VERSION` environment variable.
757+
758+
**Why it's needed:**
759+
- ROCm checks GPU architecture (GFX version) and rejects unsupported GPUs
760+
- Integrated GPUs often use newer GFX versions without full ROCm kernel support
761+
- Override tells ROCm to use kernels from a supported architecture
762+
763+
**How to use:**
764+
765+
```yaml
766+
jobs:
767+
- name: pytorch-rocm-integrated-gpu
768+
plan:
769+
- task: test-gpu
770+
tags: [rocm]
771+
config:
772+
platform: linux
773+
image_resource:
774+
type: registry-image
775+
source:
776+
repository: rocm/pytorch
777+
tag: latest
778+
run:
779+
path: sh
780+
args:
781+
- -c
782+
- |
783+
# Set override for gfx1103 (Phoenix1) - use gfx11.0.0 kernels
784+
export HSA_OVERRIDE_GFX_VERSION=11.0.0
785+
786+
# Your PyTorch code
787+
python3 -c "
788+
import torch
789+
print('CUDA (ROCm) available:', torch.cuda.is_available())
790+
x = torch.rand(5, 3).cuda()
791+
y = x * 2
792+
print('GPU computation succeeded!')
793+
print('Result:', y)
794+
"
795+
```
796+
797+
**Override values for common integrated GPUs:**
798+
799+
| GPU Architecture | GFX Version | Override Value |
800+
|------------------|-------------|----------------|
801+
| Phoenix1 (780M) | gfx1103 | `11.0.0` |
802+
| Renoir (4000 series) | gfx90c | `9.0.0` |
803+
| Cezanne (5000 series) | gfx90c | `9.0.0` |
804+
805+
**Limitations:**
806+
- ⚠️ Uses suboptimal kernels → lower performance than discrete GPUs
807+
- ⚠️ Shared system memory → memory bandwidth limitations
808+
- ⚠️ May not support all ROCm features
809+
- ✅ Good for development, testing, and light compute workloads
810+
- ❌ Not recommended for production ML training
811+
812+
**Testing on host (before deploying pipeline):**
813+
814+
```bash
815+
# Test if your integrated GPU works with override
816+
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
817+
rocm/pytorch:latest sh -c "
818+
export HSA_OVERRIDE_GFX_VERSION=11.0.0
819+
python3 -c 'import torch; x = torch.rand(5,3).cuda(); print(x * 2)'
820+
"
821+
```
822+
745823
### ROCm Troubleshooting
746824

747825
**Worker shows "GPU enabled but no GPU detected"**
@@ -759,10 +837,10 @@ fly -t local execute -c test-gpu.yml --tag=rocm
759837
- **Most common**: Missing `/dev/kfd` device
760838
- Check in container: `ls -la /dev/kfd`
761839
- Add if missing: `lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd`
762-
- **Integrated GPU**: AMD APUs (Phoenix1, Renoir, Cezanne) are NOT supported by ROCm compute
840+
- **Integrated GPU without override**: Try `HSA_OVERRIDE_GFX_VERSION` workaround (see above)
763841
- Verify GPU model: `lspci | grep -i vga`
764842
- Check PCI ID: `cat /sys/class/drm/card*/device/uevent | grep PCI_ID`
765-
- Only discrete AMD GPUs work (RX 6000/7000, Radeon Pro, MI series)
843+
- For gfx1103 (Phoenix1): `export HSA_OVERRIDE_GFX_VERSION=11.0.0`
766844
- **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Usually indicates unsupported GPU or missing drivers
767845

768846
**rocm-smi works but PyTorch doesn't detect GPU**
@@ -787,6 +865,12 @@ fly -t local execute -c test-gpu.yml --tag=rocm
787865
- Use specific GPU ID: `lxc config device add ... gpu id=1` (not generic `gpu`)
788866
- Query GPU IDs: `lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'`
789867

868+
**Integrated GPU performance issues**
869+
- If compute works but is slow, this is expected (shared memory bandwidth)
870+
- Consider discrete GPU for production workloads
871+
- Use integrated GPU for testing/development only
872+
- Monitor memory usage: integrated GPUs share system RAM
873+
790874
## Troubleshooting
791875

792876
### Charm Shows "Blocked" Status

0 commit comments

Comments
 (0)