Skip to content

Add distributed reductions #4497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 22, 2025
Merged

Add distributed reductions #4497

merged 9 commits into from
May 22, 2025

Conversation

navidcy
Copy link
Member

@navidcy navidcy commented May 12, 2025

This PR adds reductions for distributed fields.

When merged, should allow #4470 to go through

@navidcy navidcy added the distributed 🕸️ Our plan for total cluster domination label May 12, 2025
Comment on lines 108 to 117
function partition_dimensions(arch::Distributed)
R = ranks(arch)
dims = []
for r in eachindex(R)
if R[r] > 1
push!(dims, r)
end
end
return tuple(dims...)
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should find a more elegant way to compute this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also docstring? what is it doing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking which are the ``partitioned'' dimensions

@navidcy
Copy link
Member Author

navidcy commented May 12, 2025

@glwagner
Copy link
Member

How do you reduce across a dimension that is partitioned? You need to gather at the end (and then scatter to all devices)?

@simone-silvestri
Copy link
Collaborator

we compute the reduction locally and then allreduce through all devices

@glwagner
Copy link
Member

we compute the reduction locally and then allreduce through all devices

can you point to those lines?

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 13, 2025

the reduction is here

return maybe_all_reduce!($(all_reduce_op), r)

which is this function over here
function maybe_all_reduce!(op, f::ReducedAbstractField)
reduced_dims = reduced_dimensions(f)
partition_dims = partition_dimensions(f)
if any([dim partition_dims for dim in reduced_dims])
all_reduce!(op, interior(f), architecture(f))
end
return f
end

It has to be done only if (at least one) of the dimensions we are reducing are partitioned in different ranks, otherwise it is not necessary

@glwagner
Copy link
Member

ok nice, for some reason I didn't see that

@navidcy
Copy link
Member Author

navidcy commented May 16, 2025

@simone-silvestri any idea why this comes up?
https://buildkite.com/clima/oceananigans-distributed/builds/7902#0196ca9b-cf01-44ee-bc78-8b4226e78e05/207-1426

  ArgumentError: Illegal conversion of a CUDA.DeviceMemory to a Ptr{Float64}

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 20, 2025

@simone-silvestri any idea why this comes up? https://buildkite.com/clima/oceananigans-distributed/builds/7902#0196ca9b-cf01-44ee-bc78-8b4226e78e05/207-1426

  ArgumentError: Illegal conversion of a CUDA.DeviceMemory to a Ptr{Float64}

Looks like using parent(field) rather than interior(field) works in the reduction. On hindsight this is quite clear since CUDA-aware MPI does not support passing non-contiguous data.

@navidcy navidcy merged commit dd118fb into main May 22, 2025
58 checks passed
@navidcy navidcy deleted the ncc-ss/distributed-reduction branch May 22, 2025 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed 🕸️ Our plan for total cluster domination
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants