Skip to content

AMD: device architecture checking at runtime #1389

@ptheywood

Description

@ptheywood

Under CUDA, we have flamegpu::detail::compute_capability to check at runtime if the requested device(s) are compatible with the binary by comparing the runtime device compute capabililty agasint __CUDA_ARCH_LIST__ (which is not perfect but good enough for typical use) to provide more helpful errors if the device is not compatible with the produced binary.

With HIP/ROCm, I can't find a good way to replicate this:

  • no equivalent to __CUDA_ARCH_LIST__, where all compiled architectures are in the binary
  • __HIP_ARCH_*__ are defined to 0 or 1 in device passes, but we need all of them in the host pass so this is not appropriate.
  • HIP architectures have major, minor and subminor versions, but subminor is not exposed numerically in the device properties at runtime
    • The string version of an architecture available at runtime which does include the subminor
  • --offload-arch / CMAKE_HIP_ARCHITECTURES may include family options such as gfx-0-generic, which is roughly the same as just targetting a major compute capability, which can then run on all minor architectures. However gfx-9-generic does not include all gfx9XX devices, as it does not support gfx942or gfx950.
    • additionally gfx-9-4-generic supports gfx942 and gfx950. There are non-trivial rules about what these map to. This feels very brittle to try and resolve ourselves / have to keep track of from the llvm docs.

Due to this, I'm just not going to bother validating / checking the GPU architecture support at runtime, i.e. not implement flamegpu::detail::compute_capability for AMD in #1379.

Ideally we would do this for better error messages (and for heterogenious multi-gpu systems with partial compilation support), if we can find a reliable way to do this.

We could potentially have a noop kernel which we launch immedately on device initialisation, and catch the specific runtime error and assume that it must be due to an invalid value, but that's pretty grim.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions