Use GPU blit for large SharedStorage GPU→GPU copies (>32MB) by KaanKesginLW · Pull Request #716 · JuliaGPU/Metal.jl

KaanKesginLW · 2025-12-04T16:59:30Z

SharedStorage GPU→GPU copyto! currently always uses CPU memcpy, which becomes a bottleneck for large buffers. This PR adds a size-based heuristic that uses GPU blit for copies larger than 32MB, achieving up to 3.6x speedup.

Problem

PR #445 introduced CPU memcpy for SharedStorage copies to avoid ObjectiveC.jl overhead. This is beneficial for small copies where the overhead dominates, but causes significant performance regression for large copies where GPU blit would be much faster due to its higher memory bandwidth.

Solution

Use a size-based threshold (32MB) to choose the optimal path:

≤32MB: CPU memcpy (avoids API overhead)
>32MB: GPU blit (full memory bandwidth)

Why 32MB?

The threshold is based on amortizing the fixed overhead against the bandwidth differential:

Fixed overhead (command buffer creation, etc.): ~235 μs
CPU memcpy bandwidth: ~53 GB/s
GPU blit bandwidth: ~180 GB/s

Using the crossover formula threshold = overhead / (1/cpu_bw - 1/gpu_bw), the theoretical crossover is ~17MB. We use 32MB as a conservative margin to ensure no regression at the boundary.

Benchmark Results

GPU→GPU SharedStorage (improved by this change):

Size	Old (CPU memcpy)	New (heuristic)	Speedup
32 MB	0.62 ms (54 GB/s)	0.67 ms (50 GB/s)	~1.0x
64 MB	3.28 ms (20 GB/s)	1.19 ms (57 GB/s)	2.8x
128 MB	2.72 ms (49 GB/s)	1.25 ms (108 GB/s)	2.2x
256 MB	6.54 ms (41 GB/s)	2.08 ms (129 GB/s)	3.1x
512 MB	9.98 ms (54 GB/s)	3.12 ms (172 GB/s)	3.2x
1024 MB	21.55 ms (50 GB/s)	5.98 ms (180 GB/s)	3.6x
2048 MB	39.96 ms (54 GB/s)	11.74 ms (183 GB/s)	3.4x
4096 MB	79.75 ms (54 GB/s)	23.34 ms (184 GB/s)	3.4x
8192 MB	159.67 ms (54 GB/s)	46.07 ms (186 GB/s)	3.5x

Measured on M2 Max.

Community Benchmark Script

Copy-paste into a Julia REPL to test on your device:

using Metal

println("Device: ", Metal.device().name)
println("Testing SharedStorage GPU→GPU copyto! performance\n")

sizes_mb = [16, 32, 64, 128, 256, 512, 1024, 2048]

println("| Size (MB) | Time (ms) | Bandwidth (GB/s) |")
println("|-----------|-----------|------------------|")

for size_mb in sizes_mb
    n = size_mb * 1024^2 ÷ sizeof(Float32)

    src = MtlArray{Float32, 1, Metal.SharedStorage}(rand(Float32, n))
    dst = MtlArray{Float32, 1, Metal.SharedStorage}(undef, n)
    Metal.synchronize()

    # Warmup
    for _ in 1:3
        copyto!(dst, src)
        Metal.synchronize()
    end

    # Benchmark (10 iterations)
    times = Float64[]
    for _ in 1:10
        Metal.synchronize()
        t = @elapsed begin
            copyto!(dst, src)
            Metal.synchronize()
        end
        push!(times, t)
    end

    time_ms = minimum(times) * 1000
    bytes = n * sizeof(Float32)
    bandwidth = bytes / minimum(times) / 1e9

    println("| $size_mb | $(round(time_ms, digits=2)) | $(round(bandwidth, digits=1)) |")

    src = dst = nothing
    GC.gc(false)
end

Results on M2 Max (main vs PR):

Size (MB)	main (GB/s)	PR (GB/s)	Speedup
16	30	40	1.3x
32	41	46	1.1x
64	54	82	1.5x
128	54	111	2.0x
256	55	164	3.0x
512	55	176	3.2x
1024	54	182	3.4x
2048	54	185	3.5x

The main branch plateaus at ~54 GB/s due to CPU memcpy, while the PR achieves ~180-185 GB/s via GPU blit for large copies.

The 32MB threshold can be adjusted if community testing on other devices suggests a different value.

Key Points

No regression for small copies: ≤32MB uses same CPU memcpy path as before
Significant speedup for large copies: 2.8-3.6x faster for >32MB
CPU↔GPU paths unchanged: Only GPU→GPU SharedStorage is affected

github-actions · 2025-12-04T17:00:05Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/test/array.jl b/test/array.jl
index c8b45b3b..f55e7114 100644
--- a/test/array.jl
+++ b/test/array.jl
@@ -69,33 +69,33 @@ end
 end
 
 @testset "copyto!" begin
-    @testset "$S" for S in [Metal.PrivateStorage, Metal.SharedStorage]
-        @testset "$T" for T in [Float16, Float32, Bool, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8]
-            dim = (1000, 17, 10)
-            A = rand(T, dim)
-            mtlA = mtl(A; storage = S)
-
-            #cpu -> gpu
-            res = Metal.zeros(T, dim; storage = S)
-            copyto!(res, A)
-            @test Array(res) == Array(A)
-
-            #gpu -> cpu
-            res = zeros(T, dim)
-            copyto!(res, mtlA)
-            @test Array(res) == Array(mtlA)
+        @testset "$S" for S in [Metal.PrivateStorage, Metal.SharedStorage]
+            @testset "$T" for T in [Float16, Float32, Bool, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8]
+                dim = (1000, 17, 10)
+                A = rand(T, dim)
+                mtlA = mtl(A; storage = S)
+
+                #cpu -> gpu
+                res = Metal.zeros(T, dim; storage = S)
+                copyto!(res, A)
+                @test Array(res) == Array(A)
+
+                #gpu -> cpu
+                res = zeros(T, dim)
+                copyto!(res, mtlA)
+                @test Array(res) == Array(mtlA)
+
+                #gpu -> gpu
+                res = Metal.zeros(T, dim; storage = S)
+                copyto!(res, mtlA)
+                @test Array(res) == Array(mtlA)
+            end
 
-            #gpu -> gpu
-            res = Metal.zeros(T, dim; storage = S)
+            # Large array, only test Float32
+            A = rand(Float32, 32 * 2^20)
+            mtlA = mtl(A; storage = S)
+            res = similar(A)
             copyto!(res, mtlA)
-            @test Array(res) == Array(mtlA)
-        end
-
-        # Large array, only test Float32
-        A = rand(Float32, 32 * 2^20)
-        mtlA = mtl(A; storage = S)
-        res = similar(A)
-        copyto!(res, mtlA)
         @test Array(res) == Array(mtlA)
     end
 end

christiangnrd · 2025-12-04T18:17:24Z

This is interesting! Would you mind sharing the script you used to benchmark so we can ask people with various devices to test the effect of this change? Ideally it'd be something anyone with an Apple-silicon mac can copy-paste into a Julia REPL

codecov · 2025-12-04T18:47:17Z

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.19%. Comparing base (1c1115e) to head (16a0986).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/array.jl	57.14%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #716      +/-   ##
==========================================
- Coverage   81.26%   81.19%   -0.08%     
==========================================
  Files          62       62              
  Lines        2899     2904       +5     
==========================================
+ Hits         2356     2358       +2     
- Misses        543      546       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

Metal Benchmarks

Details

Benchmark suite	Current: `16a0986`	Previous: `1c1115e`	Ratio
`latency/precompile`	`25140585125` ns	`24830600583` ns	`1.01`
`latency/ttfp`	`2304812042` ns	`2266835375` ns	`1.02`
`latency/import`	`1465330542` ns	`1435733583` ns	`1.02`
`integration/metaldevrt`	`841312.5` ns	`844458` ns	`1.00`
`integration/byval/slices=1`	`1585375` ns	`1559000` ns	`1.02`
`integration/byval/slices=3`	`19245250` ns	`8227271` ns	`2.34`
`integration/byval/reference`	`1581875` ns	`1552167` ns	`1.02`
`integration/byval/slices=2`	`2718208` ns	`2584417` ns	`1.05`
`kernel/indexing`	`495333` ns	`592729` ns	`0.84`
`kernel/indexing_checked`	`487583` ns	`615500` ns	`0.79`
`kernel/launch`	`11416.5` ns	`11750` ns	`0.97`
`kernel/rand`	`517791` ns	`554042` ns	`0.93`
`array/construct`	`6167` ns	`6167` ns	`1`
`array/broadcast`	`553854` ns	`592958` ns	`0.93`
`array/random/randn/Float32`	`921416.5` ns	`795792` ns	`1.16`
`array/random/randn!/Float32`	`583375` ns	`614917` ns	`0.95`
`array/random/rand!/Int64`	`540209` ns	`550625` ns	`0.98`
`array/random/rand!/Float32`	`549291.5` ns	`581916` ns	`0.94`
`array/random/rand/Int64`	`879729` ns	`751542` ns	`1.17`
`array/random/rand/Float32`	`837042` ns	`672583` ns	`1.24`
`array/accumulate/Int64/1d`	`1304917` ns	`1252916.5` ns	`1.04`
`array/accumulate/Int64/dims=1`	`1859208.5` ns	`1802021` ns	`1.03`
`array/accumulate/Int64/dims=2`	`2207812.5` ns	`2126375` ns	`1.04`
`array/accumulate/Int64/dims=1L`	`12272542` ns	`11668167` ns	`1.05`
`array/accumulate/Int64/dims=2L`	`9717208` ns	`9627062.5` ns	`1.01`
`array/accumulate/Float32/1d`	`1080292` ns	`1118979` ns	`0.97`
`array/accumulate/Float32/dims=1`	`1601416` ns	`1522875` ns	`1.05`
`array/accumulate/Float32/dims=2`	`1964917` ns	`1825895.5` ns	`1.08`
`array/accumulate/Float32/dims=1L`	`10341500` ns	`10050958` ns	`1.03`
`array/accumulate/Float32/dims=2L`	`7416666.5` ns	`7212875` ns	`1.03`
`array/reductions/reduce/Int64/1d`	`1301187` ns	`1530166` ns	`0.85`
`array/reductions/reduce/Int64/dims=1`	`1119208.5` ns	`1075500` ns	`1.04`
`array/reductions/reduce/Int64/dims=2`	`1160792` ns	`1170250` ns	`0.99`
`array/reductions/reduce/Int64/dims=1L`	`2034958.5` ns	`2007666.5` ns	`1.01`
`array/reductions/reduce/Int64/dims=2L`	`3929458` ns	`4172166` ns	`0.94`
`array/reductions/reduce/Float32/1d`	`737458` ns	`1034979` ns	`0.71`
`array/reductions/reduce/Float32/dims=1`	`804667` ns	`807791` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`832750` ns	`841792` ns	`0.99`
`array/reductions/reduce/Float32/dims=1L`	`1336000` ns	`1306334` ns	`1.02`
`array/reductions/reduce/Float32/dims=2L`	`1812021` ns	`1780833` ns	`1.02`
`array/reductions/mapreduce/Int64/1d`	`1328937.5` ns	`1549333` ns	`0.86`
`array/reductions/mapreduce/Int64/dims=1`	`1113000` ns	`1103709` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2`	`1160959` ns	`1335541` ns	`0.87`
`array/reductions/mapreduce/Int64/dims=1L`	`2008667` ns	`2016417` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`3593334` ns	`3600500` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`802709` ns	`1006834` ns	`0.80`
`array/reductions/mapreduce/Float32/dims=1`	`812312.5` ns	`817916.5` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2`	`831542` ns	`848917` ns	`0.98`
`array/reductions/mapreduce/Float32/dims=1L`	`1351541.5` ns	`1296750` ns	`1.04`
`array/reductions/mapreduce/Float32/dims=2L`	`1794667` ns	`1795584` ns	`1.00`
`array/private/copyto!/gpu_to_gpu`	`549208` ns	`638708` ns	`0.86`
`array/private/copyto!/cpu_to_gpu`	`753542` ns	`790458.5` ns	`0.95`
`array/private/copyto!/gpu_to_cpu`	`687749.5` ns	`804083` ns	`0.86`
`array/private/iteration/findall/int`	`1577625` ns	`1618458.5` ns	`0.97`
`array/private/iteration/findall/bool`	`1488854` ns	`1439479.5` ns	`1.03`
`array/private/iteration/findfirst/int`	`2101041` ns	`2030729` ns	`1.03`
`array/private/iteration/findfirst/bool`	`2030541` ns	`2006541` ns	`1.01`
`array/private/iteration/scalar`	`3399374.5` ns	`4026208.5` ns	`0.84`
`array/private/iteration/logical`	`2678166.5` ns	`2529542` ns	`1.06`
`array/private/iteration/findmin/1d`	`2271583` ns	`2190979.5` ns	`1.04`
`array/private/iteration/findmin/2d`	`1552417` ns	`1497562.5` ns	`1.04`
`array/private/copy`	`819750` ns	`588562.5` ns	`1.39`
`array/shared/copyto!/gpu_to_gpu`	`84417` ns	`83375` ns	`1.01`
`array/shared/copyto!/cpu_to_gpu`	`83833` ns	`81917` ns	`1.02`
`array/shared/copyto!/gpu_to_cpu`	`83916` ns	`82292` ns	`1.02`
`array/shared/iteration/findall/int`	`1571000` ns	`1585625` ns	`0.99`
`array/shared/iteration/findall/bool`	`1488208.5` ns	`1438375` ns	`1.03`
`array/shared/iteration/findfirst/int`	`1707437.5` ns	`1653917` ns	`1.03`
`array/shared/iteration/findfirst/bool`	`1640229.5` ns	`1613416` ns	`1.02`
`array/shared/iteration/scalar`	`207125` ns	`201042` ns	`1.03`
`array/shared/iteration/logical`	`2415812.5` ns	`2304458.5` ns	`1.05`
`array/shared/iteration/findmin/1d`	`1897000` ns	`1799125` ns	`1.05`
`array/shared/iteration/findmin/2d`	`1556958` ns	`1499000` ns	`1.04`
`array/shared/copy`	`216750` ns	`243292` ns	`0.89`
`array/permutedims/4d`	`2480041` ns	`2349854` ns	`1.06`
`array/permutedims/2d`	`1191125` ns	`1132417` ns	`1.05`
`array/permutedims/3d`	`1768833.5` ns	`1647166` ns	`1.07`
`metal/synchronization/stream`	`19292` ns	`18833` ns	`1.02`
`metal/synchronization/context`	`20292` ns	`19834` ns	`1.02`

This comment was automatically generated by workflow using github-action-benchmark.

KaanKesginLW · 2025-12-04T19:16:47Z

Copy-paste benchmark script and results table added to the PR description above.

christiangnrd · 2025-12-05T00:06:28Z

I asked in the Slack and the results seem to show that 32MiB is a reasonable threshold.

I also added a test.

Copy-paste script (to be run from a version of Metal.jl that doesn't include this optimization)

using Printf, Metal, Random; begin

println("Device: $(Metal.device().name) ($(Metal.num_gpu_cores()) cores)")
println("Testing SharedStorage GPU→GPU copyto! performance\n")

sizes_mb = [16, 32, 64, 128, 256, 512, 1024, 2048]
if Sys.total_memory() >= 16*2^30
    push!(sizes_mb, 4096)
end
if Sys.total_memory() >= 32*2^30
    push!(sizes_mb, 8192)
end

println("| Size | CPU Bandwidth | GPU Bandwidth |")
println("| (MB) |     (GB/s)    |     (GB/s)    |")
println("|------|---------------|---------------|")

for size_mb in sizes_mb
    n = size_mb * 1024^2 ÷ sizeof(Float32)

    src = rand!(MtlArray{Float32, 1, Metal.SharedStorage}(undef, n))
    dst = MtlArray{Float32, 1, Metal.SharedStorage}(undef, n)
    Metal.synchronize()

    # Warmup
    for _ in 1:3
        copyto!(dst, src)
        Metal.synchronize()
    end

    # Benchmark (10 iterations)
    cpu_times = Float64[]
    for _ in 1:10
        Metal.synchronize()
        t = @elapsed begin
            copyto!(dst, src)
            Metal.synchronize()
        end
        GC.gc(false)
        push!(cpu_times, t)
    end

    cpu_time_ms = minimum(cpu_times) * 1000
    bytes = n * sizeof(Float32) * 2  # read + write
    cpu_bandwidth = bytes / minimum(cpu_times) / 1e9

    src = dst = nothing
    GC.gc(true)

    src = rand!(MtlArray{Float32, 1, Metal.PrivateStorage}(undef, n))
    dst = MtlArray{Float32, 1, Metal.PrivateStorage}(undef, n)
    Metal.synchronize()

    # Warmup
    for _ in 1:3
        copyto!(dst, src)
        Metal.synchronize()
    end

    # Benchmark (10 iterations)
    gpu_times = Float64[]
    for _ in 1:10
        Metal.synchronize()
        t = @elapsed begin
            copyto!(dst, src)
            Metal.synchronize()
        end
        GC.gc(false)
        push!(gpu_times, t)
    end

    gpu_time_ms = minimum(gpu_times) * 1000
    gpu_bandwidth = bytes / minimum(gpu_times) / 1e9

    src = dst = nothing
    GC.gc(true)

    @printf "| %4d | %13.1f | %13.1f |\n" size_mb cpu_bandwidth gpu_bandwidth
end
end

christiangnrd

I think this is good to go! @maleadt do you think this is a reasonable heuristic?

maleadt

LGTM (except for the Factored out for testing that I presume still needs to be updated).

SharedStorage GPU->GPU copies previously always used CPU memcpy, which is slower than GPU blit for large buffers. This commit adds a size-based heuristic: - Small copies (<=32MB): Use CPU memcpy (avoids GPU command overhead) - Large copies (>32MB): Use GPU blit (2-3x faster bandwidth) Benchmark results: - 64 MB: 2.2x faster (1.28ms -> 0.58ms) - 256 MB: 3.1x faster (5.03ms -> 1.63ms) - 1024 MB: 3.4x faster (20.2ms -> 6.0ms) The 32MB threshold was determined empirically as the crossover point where GPU blit becomes faster than CPU memcpy.

KaanKesginLW force-pushed the fix/sharedstorage-copyto-gpu-blit branch from 79ebfd3 to 2c16d2e Compare December 4, 2025 17:02

github-actions bot reviewed Dec 4, 2025

View reviewed changes

KaanKesginLW mentioned this pull request Dec 4, 2025

Change default storage mode to SharedStorage on UMA devices #717

Open

christiangnrd force-pushed the fix/sharedstorage-copyto-gpu-blit branch from 43e4269 to 7559e52 Compare December 8, 2025 02:53

christiangnrd approved these changes Dec 8, 2025

View reviewed changes

christiangnrd added the performance Gotta go fast. label Dec 8, 2025

maleadt approved these changes Dec 11, 2025

View reviewed changes

christiangnrd force-pushed the fix/sharedstorage-copyto-gpu-blit branch from 7559e52 to 2b09aee Compare December 11, 2025 13:20

KaanKesginLW and others added 4 commits December 11, 2025 12:30

Apply Runic formatting

9a77cab

Tweak and add test

b550104

Cleanup

16a0986

christiangnrd force-pushed the fix/sharedstorage-copyto-gpu-blit branch from 2b09aee to 16a0986 Compare December 11, 2025 16:30

christiangnrd merged commit 67d668c into JuliaGPU:main Dec 11, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GPU blit for large SharedStorage GPU→GPU copies (>32MB)#716

Use GPU blit for large SharedStorage GPU→GPU copies (>32MB)#716
christiangnrd merged 4 commits intoJuliaGPU:mainfrom
KaanKesginLW:fix/sharedstorage-copyto-gpu-blit

KaanKesginLW commented Dec 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

christiangnrd commented Dec 4, 2025

Uh oh!

codecov bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

KaanKesginLW commented Dec 4, 2025 •

edited

Loading

Uh oh!

christiangnrd commented Dec 5, 2025

Uh oh!

christiangnrd left a comment

Uh oh!

maleadt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KaanKesginLW commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Why 32MB?

Benchmark Results

Community Benchmark Script

Key Points

Uh oh!

github-actions bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christiangnrd commented Dec 4, 2025

Uh oh!

codecov bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

KaanKesginLW commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christiangnrd commented Dec 5, 2025

Uh oh!

christiangnrd left a comment

Choose a reason for hiding this comment

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KaanKesginLW commented Dec 4, 2025 •

edited

Loading

github-actions bot commented Dec 4, 2025 •

edited

Loading

codecov bot commented Dec 4, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

KaanKesginLW commented Dec 4, 2025 •

edited

Loading