Perf issue: Torchsharp is slower than pytorch on cuda on some operators

I run some benchmark tests to compare the performance difference between torchsharp and pytorch, both uses `libtorch 2.2.1 + cuda 12.1`. And I notice that torchsharp is slower than pytorch in most of operators. Below are benchmark result

## Torchsharp

![Image](https://github.com/user-attachments/assets/75927d6d-8e35-4ea8-bef8-c808a02957e8)

## Pytorch

![Image](https://github.com/user-attachments/assets/14118081-9215-4b26-ae21-50449ee8d19c)

## Observation
I can achieve comparable result between torchsharp and pytorch if I replace operator with in-place version. The performance also become much better if I explicitly dispose current session during each tests

For example, in `adding` benchmark, torchsharp runs nearly the same with pytorch if I use `tensor.add_` instead of `tensor.add`

Considering that the major difference between the operator and the in-place operator is the in-place operator won't create a new `Tensor` object, it's likely that the main overhead might happen in `Tensor` constructor.

## Source code
```csharp
using TorchSharp;

// Initialize CUDA device
var device = torch.CUDA;

var repeatTime = 10000;
// Test randn
var startTime = DateTime.Now;
for (int i = 0; i < repeatTime; i++)
{
    var _ = torch.randn(new long[] { 1000, 1000 }, device: device);
}

Console.WriteLine("Time taken for randn: " + (DateTime.Now - startTime).TotalSeconds);

// Test matmul
startTime = DateTime.Now;
var a = torch.randn(new long[] { 1000, 1000 }, device: device);
var b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.matmul(a, b);
}

Console.WriteLine("Time taken for matmul: " + (DateTime.Now - startTime).TotalSeconds);

// Test concat
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.cat(new[] { a, b }, 0);
}

Console.WriteLine("Time taken for concat: " + (DateTime.Now - startTime).TotalSeconds);

// Test slice
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a[.., 0..500];
}

Console.WriteLine("Time taken for slice: " + (DateTime.Now - startTime).TotalSeconds);

// Test add
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a + b;
}

Console.WriteLine("Time taken for add: " + (DateTime.Now - startTime).TotalSeconds);
```

```python
# create a list of benchmark for pytorch on cuda

import torch
import time
repeat = 10000
total_time = 0
start_time = time.time()
for _ in range(repeat):
    a = torch.randn(1000, 1000).cuda()
print("Time taken for randn: " , time.time()-start_time)

start_time = time.time()
# test matmul
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
    c = torch.matmul(a, b)
    

print("Time taken for matmul: ", time.time()-start_time)

start_time = time.time()

# test concat   
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = torch.cat((a, b), 0)

print("Time taken for concat: ", time.time()-start_time)

start_time = time.time()
# test slice
a = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a[:, 0:500]

print("Time taken for slice: ", time.time()-start_time)

start_time = time.time()
# test add
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a + b

print("Time taken for add: ", time.time()-start_time)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf issue: Torchsharp is slower than pytorch on cuda on some operators #1442

Torchsharp

Pytorch

Observation

Source code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Perf issue: Torchsharp is slower than pytorch on cuda on some operators #1442

Description

Torchsharp

Pytorch

Observation

Source code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions