Add distributed reductions #4497

navidcy · 2025-05-12T10:07:31Z

This PR adds reductions for distributed fields.

When merged, should allow #4470 to go through

simone-silvestri · 2025-05-12T10:29:47Z

src/DistributedComputations/distributed_fields.jl

+function partition_dimensions(arch::Distributed) 
+    R = ranks(arch) 
+    dims = []
+    for r in eachindex(R)
+        if R[r] > 1
+            push!(dims, r)
+        end
+    end
+    return tuple(dims...)
+end


maybe we should find a more elegant way to compute this?

also docstring? what is it doing?

checking which are the ``partitioned'' dimensions

navidcy · 2025-05-12T11:46:05Z

@simone-silvestri could you help with this error?

https://buildkite.com/clima/oceananigans-distributed/builds/7867#0196c42f-389a-4c42-93f3-ff90536a60bd/225-1583

glwagner · 2025-05-12T13:12:00Z

How do you reduce across a dimension that is partitioned? You need to gather at the end (and then scatter to all devices)?

simone-silvestri · 2025-05-13T17:03:21Z

we compute the reduction locally and then allreduce through all devices

glwagner · 2025-05-13T19:11:04Z

we compute the reduction locally and then allreduce through all devices

can you point to those lines?

simone-silvestri · 2025-05-13T19:41:16Z

the reduction is here

Oceananigans.jl/src/DistributedComputations/distributed_fields.jl

Line 152 in f582f25

return maybe_all_reduce!($(all_reduce_op), r)

which is this function over here

Oceananigans.jl/src/DistributedComputations/distributed_fields.jl

Lines 119 to 128 in f582f25

    
           function maybe_all_reduce!(op, f::ReducedAbstractField) 
        
               reduced_dims   = reduced_dimensions(f) 
        
               partition_dims = partition_dimensions(f) 
        
               if any([dim ∈ partition_dims for dim in reduced_dims]) 
        
                   all_reduce!(op, interior(f), architecture(f)) 
        
               end 
        
               return f 
        
           end

It has to be done only if (at least one) of the dimensions we are reducing are partitioned in different ranks, otherwise it is not necessary

glwagner · 2025-05-13T20:10:50Z

ok nice, for some reason I didn't see that

navidcy · 2025-05-16T04:30:19Z

@simone-silvestri any idea why this comes up?
https://buildkite.com/clima/oceananigans-distributed/builds/7902#0196ca9b-cf01-44ee-bc78-8b4226e78e05/207-1426

  ArgumentError: Illegal conversion of a CUDA.DeviceMemory to a Ptr{Float64}

src/DistributedComputations/distributed_fields.jl

Co-authored-by: Gregory L. Wagner <[email protected]>

simone-silvestri · 2025-05-20T09:17:29Z

@simone-silvestri any idea why this comes up? https://buildkite.com/clima/oceananigans-distributed/builds/7902#0196ca9b-cf01-44ee-bc78-8b4226e78e05/207-1426
  ArgumentError: Illegal conversion of a CUDA.DeviceMemory to a Ptr{Float64}

Looks like using parent(field) rather than interior(field) works in the reduction. On hindsight this is quite clear since CUDA-aware MPI does not support passing non-contiguous data.

add distributed reductions

767d575

navidcy added the distributed 🕸️ Our plan for total cluster domination label May 12, 2025

reenable tests

3689e07

simone-silvestri approved these changes May 12, 2025

View reviewed changes

simone-silvestri reviewed May 12, 2025

View reviewed changes

Merge branch 'main' into ncc-ss/distributed-reduction

f582f25

navidcy added 2 commits May 19, 2025 11:01

add docstring

8e21d55

Merge branch 'main' into ncc-ss/distributed-reduction

4757742

glwagner reviewed May 19, 2025

View reviewed changes

src/DistributedComputations/distributed_fields.jl Show resolved Hide resolved

simone-silvestri and others added 3 commits May 19, 2025 15:35

try reducing parent instead of interior

3d7f5de

Update src/DistributedComputations/distributed_fields.jl

a5b4535

Co-authored-by: Gregory L. Wagner <[email protected]>

Merge branch 'main' into ncc-ss/distributed-reduction

9c78ad5

navidcy mentioned this pull request May 20, 2025

Small improvements to NaNChecker #4470

Open

navidcy mentioned this pull request May 21, 2025

Use default prettyurls=true in docs build #4520

Merged

Merge branch 'main' into ncc-ss/distributed-reduction

d1bc77c

navidcy merged commit dd118fb into main May 22, 2025
58 checks passed

navidcy deleted the ncc-ss/distributed-reduction branch May 22, 2025 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add distributed reductions #4497

Add distributed reductions #4497

Uh oh!

navidcy commented May 12, 2025

Uh oh!

simone-silvestri May 12, 2025

Uh oh!

glwagner May 12, 2025

Uh oh!

simone-silvestri May 13, 2025

Uh oh!

navidcy commented May 12, 2025

Uh oh!

glwagner commented May 12, 2025

Uh oh!

simone-silvestri commented May 13, 2025

Uh oh!

glwagner commented May 13, 2025

Uh oh!

simone-silvestri commented May 13, 2025 •

edited

Loading

Uh oh!

glwagner commented May 13, 2025

Uh oh!

navidcy commented May 16, 2025

Uh oh!

Uh oh!

simone-silvestri commented May 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add distributed reductions #4497

Add distributed reductions #4497

Uh oh!

Conversation

navidcy commented May 12, 2025

Uh oh!

simone-silvestri May 12, 2025

Choose a reason for hiding this comment

Uh oh!

glwagner May 12, 2025

Choose a reason for hiding this comment

Uh oh!

simone-silvestri May 13, 2025

Choose a reason for hiding this comment

Uh oh!

navidcy commented May 12, 2025

Uh oh!

glwagner commented May 12, 2025

Uh oh!

simone-silvestri commented May 13, 2025

Uh oh!

glwagner commented May 13, 2025

Uh oh!

simone-silvestri commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented May 13, 2025

Uh oh!

navidcy commented May 16, 2025

Uh oh!

Uh oh!

simone-silvestri commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simone-silvestri commented May 13, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading