Open
Description
Describe the issue
(This is about inference crash. Not really sure if this should be performance category or training.)
When we tested the model inference, while everything runs fine on CPUs, the process crashes when running on AMD GPUs with
MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/MLOpen/src/hipoc/hipoc_kernel.cpp:104: Failed to launch kernel: invalid argument
2024-04-04 09:54:04.328859649 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
2024-04-04 09:54:04.328884596 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
Traceback (most recent call last):
File "test_onnx.py", line 34, in <module>
pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]
File "/work1/yfeng/yfeng/.cache/pytriton/python_backend_interpreter/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
To reproduce
the environments are installed with pip
pip install https://download.onnxruntime.ai/onnxruntime_training-1.18.0.dev20240330001%2Brocm60-cp38-cp38-manylinux_2_28_x86_64.whl
Also tested 1.17.0 and have exactly the same issue.
this is the code we used to reproduce crashing:
import onnxruntime as rt
import numpy as np
path = "model.onnx"
providers = [("ROCMExecutionProvider")]
sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = rt.InferenceSession(path, sess_options=sess_options, providers=providers)
print("provider: ", sess.get_providers())
input_name0 = sess.get_inputs()[0].name
input_name1 = sess.get_inputs()[1].name
input_name2 = sess.get_inputs()[2].name
input_name3 = sess.get_inputs()[3].name
input_name4 = sess.get_inputs()[4].name
input_name5 = sess.get_inputs()[5].name
output_name = sess.get_outputs()[0].name
nevts = 1
pf_points = np.zeros((nevts, 2, 100)).astype(np.float32)
pf_features = np.zeros((nevts, 20, 100)).astype(np.float32)
pf_mask = np.zeros((nevts, 1, 100)).astype(np.float32)
sv_points = np.zeros((nevts, 2, 10)).astype(np.float32)
sv_features = np.zeros((nevts, 11, 10)).astype(np.float32)
sv_mask = np.zeros((nevts, 1, 10)).astype(np.float32)
pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]
Urgency
This is holding our studies on AMD GPUs, and the output information is limited to debug ourselves without going deeply into the onnx code.
Any help is really appreciated!
Platform
Linux
OS Version
Rocky-Linux-9
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
MIGraphX
Execution Provider Library Version
rocm-6.0.2
Model File
the mode file we used is uploaded here: https://cernbox.cern.ch/s/WlTh9V9gfaou2cU
Is this a quantized model?
No