-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the issue
DML EP now doesn't handle the device-removal error well, it only throws generally by ORT_THROW_IF_FAILED,
onnxruntime/onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp
Line 371 in d55ade0
| ORT_THROW_IF_FAILED(m_dmlDevice->GetDeviceRemovedReason()); |
But for WebNN, if the underlying device is removed in the DML EP, a crash will occur if we continue running WebNN.
The key point is:
It's very possible that some device-removal error may somewhere occur in the DML EP.
You can't recover from device-removal except by releasing the affected device and all its children, then re-creating the DirectML device from scratch, see more details: https://learn.microsoft.com/en-us/windows/ai/directml/dml-errors
/cc @fdwr @RafaelCintron @huningxin
To reproduce
You can add some code to explicitly call RemoveDevice to emulate a device-removal scenario.
- For example, insert code as below just above the DmlCommandRecorder::ResourceBarrier .
Microsoft::WRL::ComPtr<ID3D12Device5> m_d3dDevice_5;
ORT_THROW_IF_FAILED(m_d3dDevice->QueryInterface(IID_PPV_ARGS(&m_d3dDevice_5)));
m_d3dDevice_5->RemoveDevice();
- Re-build the ORT with DML EP and copy the built dlls to "C:\Program Files<your folder>"
- Launch the chrome canary and manually select DML EP by
--webnn-ort-ep-device=<ep_name>,<hardware_vendor_id>,<hardware_device_id>flag, for example:
"%LOCALAPPDATA%\Google\Chrome SxS\Application\chrome.exe" --enable-features=WebNNOnnxRuntime,WebMachineLearningNeuralNetwork --webnn-ort-library-path-for-testing="C:\Program Files\<your folder>" --allow-third-party-modules --webnn-ort-ep-device=DmlExecutionProvider,0x8086,0x4680
- Navigate to https://wpt.live/webnn/conformance_tests/abs.https.any.html?gpu to run some WebNN tests on ORT DML EP, you can see crash happens and error log in about://gpu web page:
Name:'DmlFusedNode_0_0' Status Message: C:\Users\webnn\workspace\mingming\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2312)\onnxruntime.dll!00007FF8265EF904: (caller: 00007FF826623DFC) Exception(2) tid(5ab4) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.
[35368:23220:1119/133202.046:ERROR:services\webnn\ort\graph_impl_ort.cc:108] : [WebNN] Failed to call ort_api->Run(session_.get(), nullptr, input_names.data(), input_tensors.data(), input_names.size(), output_names.data(), output_names.size(), output_tensors.data()): [WebNN] ORT status error code: 1 error message: C:\Users\webnn\workspace\mingming\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(374)\onnxruntime.dll!00007FF826613C20: (caller: 00007FF826580C56) Exception(3) tid(5ab4) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.
Urgency
No response
Platform
Windows
OS Version
at least 24H2
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
main branch: 1851b73
ONNX Runtime API
C
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response