AMD: device architecture checking at runtime

Under CUDA, we have `flamegpu::detail::compute_capability` to check at runtime if the requested device(s) are compatible with the binary by comparing the runtime device compute capabililty agasint `__CUDA_ARCH_LIST__` (which is not perfect but good enough for typical use) to provide more helpful errors if the device is not compatible with the produced binary. 

With HIP/ROCm, I can't find a good way to replicate this:

- no equivalent to `__CUDA_ARCH_LIST__`, where all compiled architectures are in the binary
- `__HIP_ARCH_*__` are defined to `0` or `1` in device passes, but we need all of them in the host pass so this is not appropriate.
- HIP architectures have `major`, `minor` and `subminor` versions, but `subminor` is not exposed numerically in the device properties at runtime
    - The string version of an architecture available at runtime which does include the subminor
- `--offload-arch` / `CMAKE_HIP_ARCHITECTURES` may include family options such as `gfx-0-generic`, which is roughly the same as just targetting a major compute capability, which can then run on all minor architectures. However `gfx-9-generic` does not include all `gfx9XX` devices, as it does not support `gfx942`or `gfx950`.
    - additionally `gfx-9-4-generic` supports `gfx942` and `gfx950`. There are non-trivial rules about what these map to. This feels very brittle to try and resolve ourselves / have to keep track of from the llvm docs.


Due to this, I'm just not going to bother validating / checking the GPU architecture support at runtime, i.e. not implement `flamegpu::detail::compute_capability` for AMD in #1379. 

Ideally we would do this for better error messages (and for heterogenious multi-gpu systems with partial compilation support), if we can find a reliable way to do this. 

We could potentially have a noop kernel which we launch immedately on device initialisation, and catch the specific runtime error and assume that it must be due to an invalid value, but that's pretty grim.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD: device architecture checking at runtime #1389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AMD: device architecture checking at runtime #1389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions