Skip to content

[ORT GPU (DML EP)][WebNN] Handle device-removal error in DML EP #26606

@mingmingtasd

Description

@mingmingtasd

Describe the issue

DML EP now doesn't handle the device-removal error well, it only throws generally by ORT_THROW_IF_FAILED,

ORT_THROW_IF_FAILED(m_dmlDevice->GetDeviceRemovedReason());

But for WebNN, if the underlying device is removed in the DML EP, a crash will occur if we continue running WebNN.

The key point is:
It's very possible that some device-removal error may somewhere occur in the DML EP.

You can't recover from device-removal except by releasing the affected device and all its children, then re-creating the DirectML device from scratch, see more details: https://learn.microsoft.com/en-us/windows/ai/directml/dml-errors

/cc @fdwr @RafaelCintron @huningxin

To reproduce

You can add some code to explicitly call RemoveDevice to emulate a device-removal scenario.

  1. For example, insert code as below just above the DmlCommandRecorder::ResourceBarrier .
Microsoft::WRL::ComPtr<ID3D12Device5> m_d3dDevice_5;
ORT_THROW_IF_FAILED(m_d3dDevice->QueryInterface(IID_PPV_ARGS(&m_d3dDevice_5)));
m_d3dDevice_5->RemoveDevice();
  1. Re-build the ORT with DML EP and copy the built dlls to "C:\Program Files<your folder>"
  2. Launch the chrome canary and manually select DML EP by --webnn-ort-ep-device=<ep_name>,<hardware_vendor_id>,<hardware_device_id> flag, for example:
"%LOCALAPPDATA%\Google\Chrome SxS\Application\chrome.exe" --enable-features=WebNNOnnxRuntime,WebMachineLearningNeuralNetwork --webnn-ort-library-path-for-testing="C:\Program Files\<your folder>" --allow-third-party-modules --webnn-ort-ep-device=DmlExecutionProvider,0x8086,0x4680
  1. Navigate to https://wpt.live/webnn/conformance_tests/abs.https.any.html?gpu to run some WebNN tests on ORT DML EP, you can see crash happens and error log in about://gpu web page:
Name:'DmlFusedNode_0_0' Status Message: C:\Users\webnn\workspace\mingming\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2312)\onnxruntime.dll!00007FF8265EF904: (caller: 00007FF826623DFC) Exception(2) tid(5ab4) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.

[35368:23220:1119/133202.046:ERROR:services\webnn\ort\graph_impl_ort.cc:108] : [WebNN] Failed to call ort_api->Run(session_.get(), nullptr, input_names.data(), input_tensors.data(), input_names.size(), output_names.data(), output_names.size(), output_tensors.data()): [WebNN] ORT status error code: 1 error message: C:\Users\webnn\workspace\mingming\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(374)\onnxruntime.dll!00007FF826613C20: (caller: 00007FF826580C56) Exception(3) tid(5ab4) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.

Urgency

No response

Platform

Windows

OS Version

at least 24H2

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main branch: 1851b73

ONNX Runtime API

C

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:DMLissues related to the DirectML execution providerep:WebNNWebNN execution provider

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions