[Performance] CUDA execution provider on Jetson Orin Nano does not use tensor cores

### Describe the issue

I built ONNXRuntime v1.20.1 on a Jetson Orin Nano devkit (JetPack 6.1) and have observed significantly degraded performance using the CUDA execution provider (7-8x slower than TensorRT standalone). NSight shows that the tensor cores are not being used as shown in the image below.
![Image](https://github.com/user-attachments/assets/aaabb7e9-54fb-43eb-9535-a5b2b1fd1af3)
The model is not unusual (unfortunately, I can't share it but it is a relatively simple unet) and creating a TensorRT engine from the same model does make use of the tensor cores. With this build of ONNXRuntime, no model seems to use them.
Of note, this issue does not occur with a Jetson AGX Orin devkit running JetPack 5.1.

### To reproduce

```
git clone --recursive https://github.com/microsoft/onnxruntime
git checkout $RELEASE_TAG
git submodule update --init --recursive
./build.sh --config Release --update --build --build_shared_lib --parallel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /lib/aarch64-linux-gnu
```
The CUDA execution provider is configured with the following options:
```
        cudaOptions.device_id = 0;
        cudaOptions.arena_extend_strategy = 1;
        cudaOptions.gpu_mem_limit = 4 * 1024 * 1024 * 1024l;
        cudaOptions.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::OrtCudnnConvAlgoSearchDefault;
        cudaOptions.do_copy_in_default_stream = 1;
```

### Urgency

_No response_

### Platform

Other / Unknown

### OS Version

Jetpack 6.1

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

v1.21.0

### ONNX Runtime API

C++

### Architecture

ARM64

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA 12.6

### Model File

_No response_

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] CUDA execution provider on Jetson Orin Nano does not use tensor cores #24085

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] CUDA execution provider on Jetson Orin Nano does not use tensor cores #24085

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions