-
Notifications
You must be signed in to change notification settings - Fork 263
Specialize transpose! for CuMatrix #3015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Your PR no longer requires formatting changes. Thank you for your contribution! |
96b2a10 to
ca3ed39
Compare
ca3ed39 to
f9f2170
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: f9f2170 | Previous: 0c00b83 | Ratio |
|---|---|---|---|
latency/precompile |
55072220968.5 ns |
55103441406.5 ns |
1.00 |
latency/ttfp |
7843956780 ns |
7854970810.5 ns |
1.00 |
latency/import |
4144304823 ns |
4142920660.5 ns |
1.00 |
integration/volumerhs |
9623768 ns |
9624895.5 ns |
1.00 |
integration/byval/slices=1 |
146744 ns |
147201 ns |
1.00 |
integration/byval/slices=3 |
425890 ns |
426000 ns |
1.00 |
integration/byval/reference |
145064 ns |
145105 ns |
1.00 |
integration/byval/slices=2 |
286292 ns |
286632 ns |
1.00 |
integration/cudadevrt |
103758 ns |
103846 ns |
1.00 |
kernel/indexing |
14224 ns |
14265 ns |
1.00 |
kernel/indexing_checked |
15049 ns |
14925 ns |
1.01 |
kernel/occupancy |
688.6711409395973 ns |
783.1441441441441 ns |
0.88 |
kernel/launch |
2161.5555555555557 ns |
2262.3333333333335 ns |
0.96 |
kernel/rand |
16665 ns |
16624 ns |
1.00 |
array/reverse/1d |
19931 ns |
20261 ns |
0.98 |
array/reverse/2dL_inplace |
66801 ns |
66981 ns |
1.00 |
array/reverse/1dL |
70128 ns |
70447 ns |
1.00 |
array/reverse/2d |
21888 ns |
22244 ns |
0.98 |
array/reverse/1d_inplace |
9704 ns |
11580 ns |
0.84 |
array/reverse/2d_inplace |
13264 ns |
13344 ns |
0.99 |
array/reverse/2dL |
74104 ns |
74284 ns |
1.00 |
array/reverse/1dL_inplace |
66763 ns |
67021 ns |
1.00 |
array/copy |
20579 ns |
20733 ns |
0.99 |
array/iteration/findall/int |
157516 ns |
159065 ns |
0.99 |
array/iteration/findall/bool |
139333 ns |
141350 ns |
0.99 |
array/iteration/findfirst/int |
160879 ns |
162741 ns |
0.99 |
array/iteration/findfirst/bool |
161758 ns |
164024 ns |
0.99 |
array/iteration/scalar |
71683 ns |
72819 ns |
0.98 |
array/iteration/logical |
214701.5 ns |
220064.5 ns |
0.98 |
array/iteration/findmin/1d |
92317.5 ns |
56834 ns |
1.62 |
array/iteration/findmin/2d |
121380.5 ns |
98602 ns |
1.23 |
array/reductions/reduce/Int64/1d |
42865.5 ns |
44369 ns |
0.97 |
array/reductions/reduce/Int64/dims=1 |
44771 ns |
46082 ns |
0.97 |
array/reductions/reduce/Int64/dims=2 |
61660 ns |
62261 ns |
0.99 |
array/reductions/reduce/Int64/dims=1L |
88890 ns |
89560 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
87782 ns |
88709 ns |
0.99 |
array/reductions/reduce/Float32/1d |
37618 ns |
38170.5 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
42189 ns |
43662 ns |
0.97 |
array/reductions/reduce/Float32/dims=2 |
59829 ns |
60196 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
52589 ns |
52890 ns |
0.99 |
array/reductions/reduce/Float32/dims=2L |
72009 ns |
72931 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
42892 ns |
44486 ns |
0.96 |
array/reductions/mapreduce/Int64/dims=1 |
45481 ns |
51105 ns |
0.89 |
array/reductions/mapreduce/Int64/dims=2 |
61652 ns |
61916 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
88975 ns |
89497 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
87872.5 ns |
88980 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
36904 ns |
37944 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=1 |
51967 ns |
52429 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
60232 ns |
60388 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
52649 ns |
53049 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
71909.5 ns |
72843 ns |
0.99 |
array/broadcast |
20332 ns |
20274 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11081 ns |
11225 ns |
0.99 |
array/copyto!/cpu_to_gpu |
215005 ns |
218396.5 ns |
0.98 |
array/copyto!/gpu_to_cpu |
281551 ns |
284648 ns |
0.99 |
array/accumulate/Int64/1d |
125027 ns |
125449 ns |
1.00 |
array/accumulate/Int64/dims=1 |
83624 ns |
84251 ns |
0.99 |
array/accumulate/Int64/dims=2 |
158026 ns |
158690 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1709832 ns |
1709941.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
966120 ns |
967026.5 ns |
1.00 |
array/accumulate/Float32/1d |
109005 ns |
109856 ns |
0.99 |
array/accumulate/Float32/dims=1 |
80878 ns |
81373 ns |
0.99 |
array/accumulate/Float32/dims=2 |
148084 ns |
148536 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1619138 ns |
1619811 ns |
1.00 |
array/accumulate/Float32/dims=2L |
698601 ns |
699285.5 ns |
1.00 |
array/construct |
1289.25 ns |
1296.2 ns |
0.99 |
array/random/randn/Float32 |
44516 ns |
48633 ns |
0.92 |
array/random/randn!/Float32 |
24799 ns |
25237 ns |
0.98 |
array/random/rand!/Int64 |
27315 ns |
27465 ns |
0.99 |
array/random/rand!/Float32 |
8823.5 ns |
8946 ns |
0.99 |
array/random/rand/Int64 |
29632 ns |
30454 ns |
0.97 |
array/random/rand/Float32 |
13189 ns |
13364.5 ns |
0.99 |
array/permutedims/4d |
55082 ns |
55600 ns |
0.99 |
array/permutedims/2d |
53869 ns |
54423 ns |
0.99 |
array/permutedims/3d |
54783 ns |
55435 ns |
0.99 |
array/sorting/1d |
2758005 ns |
2759622.5 ns |
1.00 |
array/sorting/by |
3344784 ns |
3345835 ns |
1.00 |
array/sorting/2d |
1080734 ns |
1082443 ns |
1.00 |
cuda/synchronization/stream/auto |
1044.1 ns |
1033 ns |
1.01 |
cuda/synchronization/stream/nonblocking |
7440 ns |
7095.4 ns |
1.05 |
cuda/synchronization/stream/blocking |
845.578947368421 ns |
848.7717391304348 ns |
1.00 |
cuda/synchronization/context/auto |
1183.9 ns |
1163.5 ns |
1.02 |
cuda/synchronization/context/nonblocking |
8196.400000000001 ns |
7702.2 ns |
1.06 |
cuda/synchronization/context/blocking |
925.6388888888889 ns |
910.1 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Not sure, but I think the test failures may be unrelated? |
|
Looks unrelated I've retried CI |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3015 +/- ##
==========================================
+ Coverage 89.43% 89.45% +0.01%
==========================================
Files 148 148
Lines 12991 12995 +4
==========================================
+ Hits 11619 11625 +6
+ Misses 1372 1370 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks @kshyatt! Now it passes. |
transpose!performance can be very important when changing data layout to match the access pattern of kernels, etc.With
Before (current master branch on NVIDIA GH200):
After (this PR on NVIDIA GH200):
This implicitly speeds up
copy(transpose(A))as well.For
A = cu(rand(Float32, 5000, 50000))I get 622 μs with this PR (3.9 ms with current master branch) on an GH200 96GB, so 1.6 TB/s throughput - which seems good for a transposing read/write, total HBM3 bandwidth on the device is about 4 TB/s (so we can reach about 80% of max. bandwidth for large arrays, including the CUBLAS call overhead).Note: specialization (and speed-up) is limited to datatypes supported by
CUDA.CUBLAS.geam!(Float32,Float64,ComplexF32,ComplexF64, same as in CUBLAS itself).