Open
Description
Describe the issue
when axis != rank -1, cuda softmax can tranpose input , then call softmax kernel, then transpose back;
auto temp_input = Tensor::Create(X->DataType(), TensorShape(transposed_input_dims), alloc);
// Perform the transpose
ORT_RETURN_IF_ERROR(Transpose::DoTranspose(GetDeviceProp(),
Stream(ctx),
GetCublasHandle(ctx),
permutation, *X, *temp_input));
transposed_input = std::move(temp_input);
// Allocate memory for the intermediate output
intermediate_output = Tensor::Create(Y->DataType(), TensorShape(transposed_input_dims), alloc);
temp_input, intermediate_output alloc by temp allocator, but bind a null Stream,
so session run in multiple thread, this buffer may be use in multiple thread(multiple stream), may get wrong result
To reproduce
session use by multiple thread
Urgency
yes
Platform
Linux
OS Version
centos 7
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
v1.19
ONNX Runtime API
C++
Architecture
X86
Execution Provider
CUDA
Execution Provider Library Version
No response