Skip to content

Conversation

@oschulz
Copy link
Contributor

@oschulz oschulz commented Jan 9, 2026

transpose! performance can be very important when changing data layout to match the access pattern of kernels, etc.

With

using CUDA, LinearAlgebra, BenchmarkTools
A = cu(rand(Float32, 1000, 10000));
B = similar(A, reverse(size(A)));

@benchmark (transpose!($B, $A); CUDA.synchronize())

Before (current master branch on NVIDIA GH200):

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  172.672 μs …  1.437 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     175.648 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   176.020 μs ± 14.891 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

After (this PR on NVIDIA GH200):

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  41.536 μs … 285.217 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.904 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   44.114 μs ±   2.743 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

This implicitly speeds up copy(transpose(A)) as well.

For A = cu(rand(Float32, 5000, 50000)) I get 622 μs with this PR (3.9 ms with current master branch) on an GH200 96GB, so 1.6 TB/s throughput - which seems good for a transposing read/write, total HBM3 bandwidth on the device is about 4 TB/s (so we can reach about 80% of max. bandwidth for large arrays, including the CUBLAS call overhead).

Note: specialization (and speed-up) is limited to datatypes supported by CUDA.CUBLAS.geam! (Float32, Float64, ComplexF32, ComplexF64, same as in CUBLAS itself).

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Your PR no longer requires formatting changes. Thank you for your contribution!

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: f9f2170 Previous: 0c00b83 Ratio
latency/precompile 55072220968.5 ns 55103441406.5 ns 1.00
latency/ttfp 7843956780 ns 7854970810.5 ns 1.00
latency/import 4144304823 ns 4142920660.5 ns 1.00
integration/volumerhs 9623768 ns 9624895.5 ns 1.00
integration/byval/slices=1 146744 ns 147201 ns 1.00
integration/byval/slices=3 425890 ns 426000 ns 1.00
integration/byval/reference 145064 ns 145105 ns 1.00
integration/byval/slices=2 286292 ns 286632 ns 1.00
integration/cudadevrt 103758 ns 103846 ns 1.00
kernel/indexing 14224 ns 14265 ns 1.00
kernel/indexing_checked 15049 ns 14925 ns 1.01
kernel/occupancy 688.6711409395973 ns 783.1441441441441 ns 0.88
kernel/launch 2161.5555555555557 ns 2262.3333333333335 ns 0.96
kernel/rand 16665 ns 16624 ns 1.00
array/reverse/1d 19931 ns 20261 ns 0.98
array/reverse/2dL_inplace 66801 ns 66981 ns 1.00
array/reverse/1dL 70128 ns 70447 ns 1.00
array/reverse/2d 21888 ns 22244 ns 0.98
array/reverse/1d_inplace 9704 ns 11580 ns 0.84
array/reverse/2d_inplace 13264 ns 13344 ns 0.99
array/reverse/2dL 74104 ns 74284 ns 1.00
array/reverse/1dL_inplace 66763 ns 67021 ns 1.00
array/copy 20579 ns 20733 ns 0.99
array/iteration/findall/int 157516 ns 159065 ns 0.99
array/iteration/findall/bool 139333 ns 141350 ns 0.99
array/iteration/findfirst/int 160879 ns 162741 ns 0.99
array/iteration/findfirst/bool 161758 ns 164024 ns 0.99
array/iteration/scalar 71683 ns 72819 ns 0.98
array/iteration/logical 214701.5 ns 220064.5 ns 0.98
array/iteration/findmin/1d 92317.5 ns 56834 ns 1.62
array/iteration/findmin/2d 121380.5 ns 98602 ns 1.23
array/reductions/reduce/Int64/1d 42865.5 ns 44369 ns 0.97
array/reductions/reduce/Int64/dims=1 44771 ns 46082 ns 0.97
array/reductions/reduce/Int64/dims=2 61660 ns 62261 ns 0.99
array/reductions/reduce/Int64/dims=1L 88890 ns 89560 ns 0.99
array/reductions/reduce/Int64/dims=2L 87782 ns 88709 ns 0.99
array/reductions/reduce/Float32/1d 37618 ns 38170.5 ns 0.99
array/reductions/reduce/Float32/dims=1 42189 ns 43662 ns 0.97
array/reductions/reduce/Float32/dims=2 59829 ns 60196 ns 0.99
array/reductions/reduce/Float32/dims=1L 52589 ns 52890 ns 0.99
array/reductions/reduce/Float32/dims=2L 72009 ns 72931 ns 0.99
array/reductions/mapreduce/Int64/1d 42892 ns 44486 ns 0.96
array/reductions/mapreduce/Int64/dims=1 45481 ns 51105 ns 0.89
array/reductions/mapreduce/Int64/dims=2 61652 ns 61916 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 88975 ns 89497 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 87872.5 ns 88980 ns 0.99
array/reductions/mapreduce/Float32/1d 36904 ns 37944 ns 0.97
array/reductions/mapreduce/Float32/dims=1 51967 ns 52429 ns 0.99
array/reductions/mapreduce/Float32/dims=2 60232 ns 60388 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 52649 ns 53049 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 71909.5 ns 72843 ns 0.99
array/broadcast 20332 ns 20274 ns 1.00
array/copyto!/gpu_to_gpu 11081 ns 11225 ns 0.99
array/copyto!/cpu_to_gpu 215005 ns 218396.5 ns 0.98
array/copyto!/gpu_to_cpu 281551 ns 284648 ns 0.99
array/accumulate/Int64/1d 125027 ns 125449 ns 1.00
array/accumulate/Int64/dims=1 83624 ns 84251 ns 0.99
array/accumulate/Int64/dims=2 158026 ns 158690 ns 1.00
array/accumulate/Int64/dims=1L 1709832 ns 1709941.5 ns 1.00
array/accumulate/Int64/dims=2L 966120 ns 967026.5 ns 1.00
array/accumulate/Float32/1d 109005 ns 109856 ns 0.99
array/accumulate/Float32/dims=1 80878 ns 81373 ns 0.99
array/accumulate/Float32/dims=2 148084 ns 148536 ns 1.00
array/accumulate/Float32/dims=1L 1619138 ns 1619811 ns 1.00
array/accumulate/Float32/dims=2L 698601 ns 699285.5 ns 1.00
array/construct 1289.25 ns 1296.2 ns 0.99
array/random/randn/Float32 44516 ns 48633 ns 0.92
array/random/randn!/Float32 24799 ns 25237 ns 0.98
array/random/rand!/Int64 27315 ns 27465 ns 0.99
array/random/rand!/Float32 8823.5 ns 8946 ns 0.99
array/random/rand/Int64 29632 ns 30454 ns 0.97
array/random/rand/Float32 13189 ns 13364.5 ns 0.99
array/permutedims/4d 55082 ns 55600 ns 0.99
array/permutedims/2d 53869 ns 54423 ns 0.99
array/permutedims/3d 54783 ns 55435 ns 0.99
array/sorting/1d 2758005 ns 2759622.5 ns 1.00
array/sorting/by 3344784 ns 3345835 ns 1.00
array/sorting/2d 1080734 ns 1082443 ns 1.00
cuda/synchronization/stream/auto 1044.1 ns 1033 ns 1.01
cuda/synchronization/stream/nonblocking 7440 ns 7095.4 ns 1.05
cuda/synchronization/stream/blocking 845.578947368421 ns 848.7717391304348 ns 1.00
cuda/synchronization/context/auto 1183.9 ns 1163.5 ns 1.02
cuda/synchronization/context/nonblocking 8196.400000000001 ns 7702.2 ns 1.06
cuda/synchronization/context/blocking 925.6388888888889 ns 910.1 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@oschulz
Copy link
Contributor Author

oschulz commented Jan 10, 2026

Not sure, but I think the test failures may be unrelated?

@kshyatt kshyatt added cuda array Stuff about CuArray. performance How fast can we go? labels Jan 12, 2026
@kshyatt
Copy link
Member

kshyatt commented Jan 12, 2026

Looks unrelated I've retried CI

@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (0c00b83) to head (f9f2170).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3015      +/-   ##
==========================================
+ Coverage   89.43%   89.45%   +0.01%     
==========================================
  Files         148      148              
  Lines       12991    12995       +4     
==========================================
+ Hits        11619    11625       +6     
+ Misses       1372     1370       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@oschulz
Copy link
Contributor Author

oschulz commented Jan 12, 2026

Thanks @kshyatt! Now it passes.

@kshyatt kshyatt merged commit da38676 into JuliaGPU:master Jan 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda array Stuff about CuArray. performance How fast can we go?

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants