Skip to content

Perf issue: Torchsharp is slower than pytorch on cuda on some operators #1442

Open
@LittleLittleCloud

Description

@LittleLittleCloud

I run some benchmark tests to compare the performance difference between torchsharp and pytorch, both uses libtorch 2.2.1 + cuda 12.1. And I notice that torchsharp is slower than pytorch in most of operators. Below are benchmark result

Torchsharp

Image

Pytorch

Image

Observation

I can achieve comparable result between torchsharp and pytorch if I replace operator with in-place version. The performance also become much better if I explicitly dispose current session during each tests

For example, in adding benchmark, torchsharp runs nearly the same with pytorch if I use tensor.add_ instead of tensor.add

Considering that the major difference between the operator and the in-place operator is the in-place operator won't create a new Tensor object, it's likely that the main overhead might happen in Tensor constructor.

Source code

using TorchSharp;

// Initialize CUDA device
var device = torch.CUDA;

var repeatTime = 10000;
// Test randn
var startTime = DateTime.Now;
for (int i = 0; i < repeatTime; i++)
{
    var _ = torch.randn(new long[] { 1000, 1000 }, device: device);
}

Console.WriteLine("Time taken for randn: " + (DateTime.Now - startTime).TotalSeconds);

// Test matmul
startTime = DateTime.Now;
var a = torch.randn(new long[] { 1000, 1000 }, device: device);
var b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.matmul(a, b);
}

Console.WriteLine("Time taken for matmul: " + (DateTime.Now - startTime).TotalSeconds);

// Test concat
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.cat(new[] { a, b }, 0);
}

Console.WriteLine("Time taken for concat: " + (DateTime.Now - startTime).TotalSeconds);

// Test slice
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a[.., 0..500];
}

Console.WriteLine("Time taken for slice: " + (DateTime.Now - startTime).TotalSeconds);

// Test add
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a + b;
}

Console.WriteLine("Time taken for add: " + (DateTime.Now - startTime).TotalSeconds);
# create a list of benchmark for pytorch on cuda

import torch
import time
repeat = 10000
total_time = 0
start_time = time.time()
for _ in range(repeat):
    a = torch.randn(1000, 1000).cuda()
print("Time taken for randn: " , time.time()-start_time)

start_time = time.time()
# test matmul
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
    c = torch.matmul(a, b)
    

print("Time taken for matmul: ", time.time()-start_time)

start_time = time.time()

# test concat   
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = torch.cat((a, b), 0)

print("Time taken for concat: ", time.time()-start_time)

start_time = time.time()
# test slice
a = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a[:, 0:500]

print("Time taken for slice: ", time.time()-start_time)

start_time = time.time()
# test add
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a + b

print("Time taken for add: ", time.time()-start_time)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions