Skip to content

Error While Creating ONNX Session with CUDA Execution Provider #22980

Open
@akashGEHC

Description

@akashGEHC

Describe the issue

We are encountering an issue while creating an ONNX session using the CUDA Execution Provider in a Kubernetes (k8s) environment.

Context

We are performing GPU-based inferencing with ONNX Runtime using the CUDA and TensorRT providers in our C++ application. The shared ONNX Runtime (ORT) libraries are being used.

Code Snippet

Here’s the relevant code:

// Initialize the ONNX Runtime environment
auto env = std::make_unique<Ort::Env>(ORT_LOGGING_LEVEL_VERBOSE, "InferenceUtil");

//Set up Ort session options
Ort::SessionOptions session_options;

OrtCUDAProviderOptions cuda_options = {};
session_options.AppendExecutionProvider_CUDA(cuda_options);

session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_DISABLE_ALL); 

 // Load the ONNX model from the specified path
const std::string model_path_str = details::model_path(); // Fetch the model path

session = std::make_unique<Ort::Session>(*env, model_path_str.c_str(), session_options);

CMAKE File:

 target_link_libraries(${PROJECT_NAME}
    -L${IMPORT_DIR_ONNX}/cuda/lib64 -lcudart
    -L${IMPORT_DIR_ONNX}/cuda/lib64 -lcudnn
    -L${IMPORT_DIR_ONNX}/onnxruntime-linux-x64-gpu-1.18.0/lib -lonnxruntime
  )

Error Logs

CUDA failure 100: no CUDA-capable device is detected ;

initInfer() :: Exception caught while initializing inference: /tmp/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /tmp/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 100: no CUDA-capable device is detected ; GPU=0 ; hostname=algorithm-runner-6c467684d-wqrnb ; file=/tmp/onnxruntime/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id);"

Environment Details

  • ONNX Runtime Version: 1.18.0

  • CUDA Version: 12.2

  • cuDNN Version: 8.9.7

  • OS: SLES 15.5 , Running on a Kubernetes environment

  • Hardware:

    • GPU: Tesla T4
    • Driver Version: 550.107.02
    • CUDA Version from nvidia-smi: 12.4

Diagnostics

Output of nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Output Of nvidia-smi -L:

GPU 0: Tesla T4 (UUID: GPU-06b02b25-f8bb-b475-f628-805b3984d63f)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution providerstaleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions