Description
Currently the ways we have to let a ROCm kernel notify the host of problems are
assert(<false condition>)
abort()
Unfortunately they behave exactly like their host counterpart, causing the application to terminate immediately.
At the very least this makes it quite hard to test that error conditions are handled correctly:
-
testRocmSoALayoutAndView_t
tests that an out-of-bound access is properly detected; it is, but it results in:0:rocdevice.cpp :3020: 6791392435440d us: Callback: Queue 0x14e920e00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
which results in a unit test failure.
-
alpakaTestBufferROCmAsync
tests that a device-side assertion can be triggered and detected; this results insrc/HeterogeneousCore/AlpakaInterface/test/alpaka/testBuffer.dev.cc:38: auto (anonymous namespace)::testDeviceSideError(const Device &)::(anonymous class)::operator()(const Acc1D &, int *, size_t) const: Device-side assertion `data[index] != 0' failed. :0:rocdevice.cpp :3020: 6791407414776d us: Callback: Queue 0x1523e1400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
GPU core dump created: gpucore.5065
````
Going through to the HIP / ROCm documentation, I didn't find any method to notify the host of a device-side error condition without aborting the whole application.
Can we design something better using only client code ?