Skip to content

Better way to notify host code of an error in a ROCm kernel ? #47674

Open
@fwyzard

Description

@fwyzard

Currently the ways we have to let a ROCm kernel notify the host of problems are

  • assert(<false condition>)
  • abort()

Unfortunately they behave exactly like their host counterpart, causing the application to terminate immediately.

At the very least this makes it quite hard to test that error conditions are handled correctly:

  • testRocmSoALayoutAndView_t tests that an out-of-bound access is properly detected; it is, but it results in

    :0:rocdevice.cpp            :3020: 6791392435440d us:  Callback: Queue 0x14e920e00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
    

    which results in a unit test failure.

  • alpakaTestBufferROCmAsync tests that a device-side assertion can be triggered and detected; this results in

    src/HeterogeneousCore/AlpakaInterface/test/alpaka/testBuffer.dev.cc:38: auto (anonymous namespace)::testDeviceSideError(const Device &)::(anonymous class)::operator()(const Acc1D &, int *, size_t) const: Device-side assertion `data[index] != 0' failed.
    :0:rocdevice.cpp            :3020: 6791407414776d us:  Callback: Queue 0x1523e1400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
    

GPU core dump created: gpucore.5065
````

Going through to the HIP / ROCm documentation, I didn't find any method to notify the host of a device-side error condition without aborting the whole application.

Can we design something better using only client code ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions