[Performance] Non-zero status code and MIOPEN failure when running inference on AMD GPUs.

### Describe the issue

(This is about inference crash. Not really sure if this should be performance category or training.)

When we tested the model inference, while everything runs fine on CPUs, the process crashes when running on AMD GPUs with
```
MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/MLOpen/src/hipoc/hipoc_kernel.cpp:104: Failed to launch kernel: invalid argument
2024-04-04 09:54:04.328859649 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 
2024-04-04 09:54:04.328884596 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 
Traceback (most recent call last):
  File "test_onnx.py", line 34, in <module>
    pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]
  File "/work1/yfeng/yfeng/.cache/pytriton/python_backend_interpreter/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_8' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=t004-005.hpcfund ; file=/build/Release/amdgpu/onnxruntime/core/providers/rocm/nn/batch_norm.cc ; line=166 ; expr=BatchNormalizationForwardInferenceHelper( GetMiopenHandle(p_op_kernel_context), miopen_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 

```

### To reproduce

the environments are installed with pip
```sh
pip install https://download.onnxruntime.ai/onnxruntime_training-1.18.0.dev20240330001%2Brocm60-cp38-cp38-manylinux_2_28_x86_64.whl
```
Also tested 1.17.0 and have exactly the same issue. 

this is the code we used to reproduce crashing:
```py
import onnxruntime as rt
import numpy as np

path = "model.onnx"

providers = [("ROCMExecutionProvider")]
sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = rt.InferenceSession(path, sess_options=sess_options, providers=providers)
print("provider: ", sess.get_providers())
input_name0 = sess.get_inputs()[0].name
input_name1 = sess.get_inputs()[1].name
input_name2 = sess.get_inputs()[2].name
input_name3 = sess.get_inputs()[3].name
input_name4 = sess.get_inputs()[4].name
input_name5 = sess.get_inputs()[5].name

output_name = sess.get_outputs()[0].name

nevts = 1
pf_points = np.zeros((nevts, 2, 100)).astype(np.float32)
pf_features = np.zeros((nevts, 20, 100)).astype(np.float32)
pf_mask = np.zeros((nevts, 1, 100)).astype(np.float32)
sv_points = np.zeros((nevts, 2, 10)).astype(np.float32)
sv_features = np.zeros((nevts, 11, 10)).astype(np.float32)
sv_mask = np.zeros((nevts, 1, 10)).astype(np.float32)

pred_onx = sess.run([output_name], {input_name0: pf_points, input_name1: pf_features, input_name2: pf_mask, input_name3: sv_points, input_name4: sv_features, input_name5: sv_mask})[0]
```


### Urgency

This is holding our studies on AMD GPUs, and the output information is limited to debug ourselves without going deeply into the onnx code. 

Any help is really appreciated!

### Platform

Linux

### OS Version

Rocky-Linux-9

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.18.0

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

MIGraphX

### Execution Provider Library Version

rocm-6.0.2

### Model File

the mode file we used is uploaded here: https://cernbox.cern.ch/s/WlTh9V9gfaou2cU 

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Non-zero status code and MIOPEN failure when running inference on AMD GPUs. #20203

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Non-zero status code and MIOPEN failure when running inference on AMD GPUs. #20203

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions