Description
Describe the issue
I built ONNXRuntime v1.20.1 on a Jetson Orin Nano devkit (JetPack 6.1) and have observed significantly degraded performance using the CUDA execution provider (7-8x slower than TensorRT standalone). NSight shows that the tensor cores are not being used as shown in the image below.
The model is not unusual (unfortunately, I can't share it but it is a relatively simple unet) and creating a TensorRT engine from the same model does make use of the tensor cores. With this build of ONNXRuntime, no model seems to use them.
Of note, this issue does not occur with a Jetson AGX Orin devkit running JetPack 5.1.
To reproduce
git clone --recursive https://github.com/microsoft/onnxruntime
git checkout $RELEASE_TAG
git submodule update --init --recursive
./build.sh --config Release --update --build --build_shared_lib --parallel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /lib/aarch64-linux-gnu
The CUDA execution provider is configured with the following options:
cudaOptions.device_id = 0;
cudaOptions.arena_extend_strategy = 1;
cudaOptions.gpu_mem_limit = 4 * 1024 * 1024 * 1024l;
cudaOptions.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::OrtCudnnConvAlgoSearchDefault;
cudaOptions.do_copy_in_default_stream = 1;
Urgency
No response
Platform
Other / Unknown
OS Version
Jetpack 6.1
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
v1.21.0
ONNX Runtime API
C++
Architecture
ARM64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.6
Model File
No response
Is this a quantized model?
No