Skip to content

Move CUDA stuff to an extension #4499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

michel2323
Copy link
Collaborator

@michel2323 michel2323 commented May 12, 2025

This PR isolates CUDA into src/arch_cuda.jl. This removes any direct CUDA calls in the remaining Oceananigans code base. That feel can either serve as a template for a new GPU architecture or for a future CUDA extension. @vchuravy

@glwagner
Copy link
Member

Possibly, we should simply implement a CUDA extension in this PR with appropriate organization of the code and get on with the breaking change!

tl;dr then after this is merged, anybody doing computations on nvidia GPU has to write

using Oceananigans
using CUDA

@glwagner
Copy link
Member

@simone-silvestri curious to hear your thoughts

@simone-silvestri
Copy link
Collaborator

I think it's a good idea. It provides templates to add new architectures and makes the code completely architecture agnostic. the extra using CUDA is a small price to pay.

@navidcy navidcy added the GPU 👾 Where Oceananigans gets its powers from label May 13, 2025
@michel2323 michel2323 force-pushed the ms/ka branch 2 times, most recently from 69eb545 to 59a441d Compare May 16, 2025 16:08
@glwagner
Copy link
Member

@michel2323 let us know when this is ready for prime time

Comment on lines -288 to +287
isnothing(devices) ? device!(node_rank % ndevices()) : device!(devices[node_rank+1])
isnothing(devices) ? device!(child_architecture, node_rank % ndevices(child_architecture)) : device!(child_architecture, devices[node_rank+1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel2323
Copy link
Collaborator Author

michel2323 commented Jun 11, 2025

@glwagner For the failing tests, we have 4 in total

  • oceanangians-distributed: I think it can't find the commit because it's run from a fork.
  • cpu-turbulence-closure-tests: Bus error. No idea man.
  • gpu-multi-region-tests: That's the hard one I have to sit on. I dived into it and it's definitely my changes. However, I don't understand what is different at runtime that triggers this.
  • Documentation something.

@michel2323
Copy link
Collaborator Author

I think the main problem is that I haven't figured out what getdevice actually means across all objects where it is implemented. In particular, there's a bunch of getdevice(somearray). @vchuravy How would that look like with KA? Do the arrays know on which device they are?

@vchuravy
Copy link
Collaborator

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

@michel2323
Copy link
Collaborator Author

michel2323 commented Jun 12, 2025

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

@simone-silvestri @glwagner How do you want to proceed with these? Can this be rewritten to only use stuff from Architectures?

https://github.com/michel2323/Oceananigans.jl/blob/2e5f75498e8fa7896a91241351c0e2bac9904adc/src/Utils/multi_region_transformation.jl#L54-L70

@michel2323
Copy link
Collaborator Author

@michel2323 let us know when this is ready for prime time

Finally ready. The documentation breaks due to something unrelated I think. buildkite/oceananigans-distributed isn't run because the code comes from a fork.

@michel2323 michel2323 requested a review from glwagner June 19, 2025 18:23
device(a::GPU) = a.device
device!(::CPU, i) = KA.device!(CPU(), i+1)
device!(::CPU) = nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single arg device!?

Copy link
Collaborator Author

@michel2323 michel2323 Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to how the CPU was handled as a device in the multi-region code(e.g., here and here). As discussed with @glwagner, I tried not to touch the multi-region code. This will be fixed in an upcoming PR together with a consistent terminology for backend, arch, and device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vchuravy we may want to chat, but as far as I can tell there is basically an inconsistency in the concept of "device" in Oceananigans (we use the backend and "device id" interchangeably). It's a mess and we need to clean this up; I was thinking we would do this in a subsequent PR, but another option is to do it prior to this PR...

@michel2323 I don't see a call to device! in the code you linked. There is a call to switch_device! -- is that a synonym for device!?

break
end
end
if backend isa Nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if backend isa Nothing
if backend === nothing

regional_return_values = Vector(undef, length(devs))
for (r, dev) in enumerate(devs)
switch_device!(dev)
# switch_device!(dev)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to switch devices don't you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we plan to discontinue support for multi-device MultiRegion (eg move towards the requirement that all multi-device code uses MPI). I don't think multi-device functionality is tested either...

Comment on lines +8 to +15
if isdefined(Main, :CUDA)
try
return versioninfo_with_gpu(GPU())
catch
return "No GPU device found."
end
else
return ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glwagner
Copy link
Member

Seems like this is getting close which is exciting!

@glwagner glwagner changed the title Isolate CUDA Move CUDA stuff to an extension Jun 27, 2025
@glwagner
Copy link
Member

@siddharthabishnu can you take a look at the cubed sphere / multi region stuff here, it will affect you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU 👾 Where Oceananigans gets its powers from
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants