Skip to content

[Performance] Non-zero status code and MIOPEN failure when running inference on AMD GPUs. #20203

Open
@yongbinfeng

Description

@yongbinfeng

Describe the issue

(This is about inference crash. Not really sure if this should be performance category or training.)

When we tested the model inference, while everything runs fine on CPUs, the process crashes when running on AMD GPUs with

MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/MLOpen/src/hipoc/hipoc_kernel.cpp:104: Failed to launch kernel: invalid argument
2024-04-04 09:54:04.328859649 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 
2024-04-04 09:54:04.328884596 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 
Traceback (most recent call last):
  File "test_onnx.py", line 34, in <module>
    pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]
  File "/work1/yfeng/yfeng/.cache/pytriton/python_backend_interpreter/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 

To reproduce

the environments are installed with pip

pip install https://download.onnxruntime.ai/onnxruntime_training-1.18.0.dev20240330001%2Brocm60-cp38-cp38-manylinux_2_28_x86_64.whl

Also tested 1.17.0 and have exactly the same issue.

this is the code we used to reproduce crashing:

import onnxruntime as rt
import numpy as np

path = "model.onnx"

providers = [("ROCMExecutionProvider")]
sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = rt.InferenceSession(path, sess_options=sess_options, providers=providers)
print("provider: ", sess.get_providers())
input_name0 = sess.get_inputs()[0].name
input_name1 = sess.get_inputs()[1].name
input_name2 = sess.get_inputs()[2].name
input_name3 = sess.get_inputs()[3].name
input_name4 = sess.get_inputs()[4].name
input_name5 = sess.get_inputs()[5].name

output_name = sess.get_outputs()[0].name

nevts = 1
pf_points = np.zeros((nevts, 2, 100)).astype(np.float32)
pf_features = np.zeros((nevts, 20, 100)).astype(np.float32)
pf_mask = np.zeros((nevts, 1, 100)).astype(np.float32)
sv_points = np.zeros((nevts, 2, 10)).astype(np.float32)
sv_features = np.zeros((nevts, 11, 10)).astype(np.float32)
sv_mask = np.zeros((nevts, 1, 10)).astype(np.float32)

pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]

Urgency

This is holding our studies on AMD GPUs, and the output information is limited to debug ourselves without going deeply into the onnx code.

Any help is really appreciated!

Platform

Linux

OS Version

Rocky-Linux-9

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

MIGraphX

Execution Provider Library Version

rocm-6.0.2

Model File

the mode file we used is uploaded here: https://cernbox.cern.ch/s/WlTh9V9gfaou2cU

Is this a quantized model?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:MIGraphXissues related to AMD MI GraphX execution providerep:ROCmquestions/issues related to ROCm execution providerstaleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions