Specialize transpose! for CuMatrix #3015

oschulz · 2026-01-09T20:26:04Z

transpose! performance can be very important when changing data layout to match the access pattern of kernels, etc.

With

using CUDA, LinearAlgebra, BenchmarkTools
A = cu(rand(Float32, 1000, 10000));
B = similar(A, reverse(size(A)));

@benchmark (transpose!($B, $A); CUDA.synchronize())

Before (current master branch on NVIDIA GH200):

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  172.672 μs …  1.437 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     175.648 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   176.020 μs ± 14.891 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

After (this PR on NVIDIA GH200):

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  41.536 μs … 285.217 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.904 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   44.114 μs ±   2.743 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

This implicitly speeds up copy(transpose(A)) as well.

For A = cu(rand(Float32, 5000, 50000)) I get 622 μs with this PR (3.9 ms with current master branch) on an GH200 96GB, so 1.6 TB/s throughput - which seems good for a transposing read/write, total HBM3 bandwidth on the device is about 4 TB/s (so we can reach about 80% of max. bandwidth for large arrays, including the CUBLAS call overhead).

Note: specialization (and speed-up) is limited to datatypes supported by CUDA.CUBLAS.geam! (Float32, Float64, ComplexF32, ComplexF64, same as in CUBLAS itself).

github-actions · 2026-01-09T20:26:42Z

Your PR no longer requires formatting changes. Thank you for your contribution!

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `f9f2170`	Previous: `0c00b83`	Ratio
`latency/precompile`	`55072220968.5` ns	`55103441406.5` ns	`1.00`
`latency/ttfp`	`7843956780` ns	`7854970810.5` ns	`1.00`
`latency/import`	`4144304823` ns	`4142920660.5` ns	`1.00`
`integration/volumerhs`	`9623768` ns	`9624895.5` ns	`1.00`
`integration/byval/slices=1`	`146744` ns	`147201` ns	`1.00`
`integration/byval/slices=3`	`425890` ns	`426000` ns	`1.00`
`integration/byval/reference`	`145064` ns	`145105` ns	`1.00`
`integration/byval/slices=2`	`286292` ns	`286632` ns	`1.00`
`integration/cudadevrt`	`103758` ns	`103846` ns	`1.00`
`kernel/indexing`	`14224` ns	`14265` ns	`1.00`
`kernel/indexing_checked`	`15049` ns	`14925` ns	`1.01`
`kernel/occupancy`	`688.6711409395973` ns	`783.1441441441441` ns	`0.88`
`kernel/launch`	`2161.5555555555557` ns	`2262.3333333333335` ns	`0.96`
`kernel/rand`	`16665` ns	`16624` ns	`1.00`
`array/reverse/1d`	`19931` ns	`20261` ns	`0.98`
`array/reverse/2dL_inplace`	`66801` ns	`66981` ns	`1.00`
`array/reverse/1dL`	`70128` ns	`70447` ns	`1.00`
`array/reverse/2d`	`21888` ns	`22244` ns	`0.98`
`array/reverse/1d_inplace`	`9704` ns	`11580` ns	`0.84`
`array/reverse/2d_inplace`	`13264` ns	`13344` ns	`0.99`
`array/reverse/2dL`	`74104` ns	`74284` ns	`1.00`
`array/reverse/1dL_inplace`	`66763` ns	`67021` ns	`1.00`
`array/copy`	`20579` ns	`20733` ns	`0.99`
`array/iteration/findall/int`	`157516` ns	`159065` ns	`0.99`
`array/iteration/findall/bool`	`139333` ns	`141350` ns	`0.99`
`array/iteration/findfirst/int`	`160879` ns	`162741` ns	`0.99`
`array/iteration/findfirst/bool`	`161758` ns	`164024` ns	`0.99`
`array/iteration/scalar`	`71683` ns	`72819` ns	`0.98`
`array/iteration/logical`	`214701.5` ns	`220064.5` ns	`0.98`
`array/iteration/findmin/1d`	`92317.5` ns	`56834` ns	`1.62`
`array/iteration/findmin/2d`	`121380.5` ns	`98602` ns	`1.23`
`array/reductions/reduce/Int64/1d`	`42865.5` ns	`44369` ns	`0.97`
`array/reductions/reduce/Int64/dims=1`	`44771` ns	`46082` ns	`0.97`
`array/reductions/reduce/Int64/dims=2`	`61660` ns	`62261` ns	`0.99`
`array/reductions/reduce/Int64/dims=1L`	`88890` ns	`89560` ns	`0.99`
`array/reductions/reduce/Int64/dims=2L`	`87782` ns	`88709` ns	`0.99`
`array/reductions/reduce/Float32/1d`	`37618` ns	`38170.5` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`42189` ns	`43662` ns	`0.97`
`array/reductions/reduce/Float32/dims=2`	`59829` ns	`60196` ns	`0.99`
`array/reductions/reduce/Float32/dims=1L`	`52589` ns	`52890` ns	`0.99`
`array/reductions/reduce/Float32/dims=2L`	`72009` ns	`72931` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`42892` ns	`44486` ns	`0.96`
`array/reductions/mapreduce/Int64/dims=1`	`45481` ns	`51105` ns	`0.89`
`array/reductions/mapreduce/Int64/dims=2`	`61652` ns	`61916` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`88975` ns	`89497` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`87872.5` ns	`88980` ns	`0.99`
`array/reductions/mapreduce/Float32/1d`	`36904` ns	`37944` ns	`0.97`
`array/reductions/mapreduce/Float32/dims=1`	`51967` ns	`52429` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2`	`60232` ns	`60388` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`52649` ns	`53049` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`71909.5` ns	`72843` ns	`0.99`
`array/broadcast`	`20332` ns	`20274` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`11081` ns	`11225` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`215005` ns	`218396.5` ns	`0.98`
`array/copyto!/gpu_to_cpu`	`281551` ns	`284648` ns	`0.99`
`array/accumulate/Int64/1d`	`125027` ns	`125449` ns	`1.00`
`array/accumulate/Int64/dims=1`	`83624` ns	`84251` ns	`0.99`
`array/accumulate/Int64/dims=2`	`158026` ns	`158690` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1709832` ns	`1709941.5` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`966120` ns	`967026.5` ns	`1.00`
`array/accumulate/Float32/1d`	`109005` ns	`109856` ns	`0.99`
`array/accumulate/Float32/dims=1`	`80878` ns	`81373` ns	`0.99`
`array/accumulate/Float32/dims=2`	`148084` ns	`148536` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1619138` ns	`1619811` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`698601` ns	`699285.5` ns	`1.00`
`array/construct`	`1289.25` ns	`1296.2` ns	`0.99`
`array/random/randn/Float32`	`44516` ns	`48633` ns	`0.92`
`array/random/randn!/Float32`	`24799` ns	`25237` ns	`0.98`
`array/random/rand!/Int64`	`27315` ns	`27465` ns	`0.99`
`array/random/rand!/Float32`	`8823.5` ns	`8946` ns	`0.99`
`array/random/rand/Int64`	`29632` ns	`30454` ns	`0.97`
`array/random/rand/Float32`	`13189` ns	`13364.5` ns	`0.99`
`array/permutedims/4d`	`55082` ns	`55600` ns	`0.99`
`array/permutedims/2d`	`53869` ns	`54423` ns	`0.99`
`array/permutedims/3d`	`54783` ns	`55435` ns	`0.99`
`array/sorting/1d`	`2758005` ns	`2759622.5` ns	`1.00`
`array/sorting/by`	`3344784` ns	`3345835` ns	`1.00`
`array/sorting/2d`	`1080734` ns	`1082443` ns	`1.00`
`cuda/synchronization/stream/auto`	`1044.1` ns	`1033` ns	`1.01`
`cuda/synchronization/stream/nonblocking`	`7440` ns	`7095.4` ns	`1.05`
`cuda/synchronization/stream/blocking`	`845.578947368421` ns	`848.7717391304348` ns	`1.00`
`cuda/synchronization/context/auto`	`1183.9` ns	`1163.5` ns	`1.02`
`cuda/synchronization/context/nonblocking`	`8196.400000000001` ns	`7702.2` ns	`1.06`
`cuda/synchronization/context/blocking`	`925.6388888888889` ns	`910.1` ns	`1.02`

This comment was automatically generated by workflow using github-action-benchmark.

oschulz · 2026-01-10T13:02:45Z

Not sure, but I think the test failures may be unrelated?

kshyatt · 2026-01-12T08:16:20Z

Looks unrelated I've retried CI

codecov · 2026-01-12T10:59:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (0c00b83) to head (f9f2170).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3015      +/-   ##
==========================================
+ Coverage   89.43%   89.45%   +0.01%     
==========================================
  Files         148      148              
  Lines       12991    12995       +4     
==========================================
+ Hits        11619    11625       +6     
+ Misses       1372     1370       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

oschulz · 2026-01-12T15:42:13Z

Thanks @kshyatt! Now it passes.

oschulz force-pushed the fast-transpose branch from 96b2a10 to ca3ed39 Compare January 9, 2026 20:38

Specialize transpose! for CuMatrix

f9f2170

oschulz force-pushed the fast-transpose branch from ca3ed39 to f9f2170 Compare January 9, 2026 20:40

github-actions bot reviewed Jan 9, 2026

View reviewed changes

kshyatt added cuda array Stuff about CuArray. performance How fast can we go? labels Jan 12, 2026

kshyatt merged commit da38676 into JuliaGPU:master Jan 12, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specialize transpose! for CuMatrix #3015

Specialize transpose! for CuMatrix #3015

Uh oh!

oschulz commented Jan 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

oschulz commented Jan 10, 2026

Uh oh!

kshyatt commented Jan 12, 2026

Uh oh!

codecov bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

oschulz commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Specialize transpose! for CuMatrix #3015

Specialize transpose! for CuMatrix #3015

Uh oh!

Conversation

oschulz commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

oschulz commented Jan 10, 2026

Uh oh!

kshyatt commented Jan 12, 2026

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

oschulz commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oschulz commented Jan 9, 2026 •

edited

Loading

github-actions bot commented Jan 9, 2026 •

edited

Loading

codecov bot commented Jan 12, 2026 •

edited

Loading