Move CUDA stuff to an extension #4499

michel2323 · 2025-05-12T14:39:14Z

This PR isolates CUDA into src/arch_cuda.jl. This removes any direct CUDA calls in the remaining Oceananigans code base. That feel can either serve as a template for a new GPU architecture or for a future CUDA extension. @vchuravy

src/arch_cuda.jl

glwagner · 2025-05-12T15:10:26Z

Possibly, we should simply implement a CUDA extension in this PR with appropriate organization of the code and get on with the breaking change!

tl;dr then after this is merged, anybody doing computations on nvidia GPU has to write

using Oceananigans
using CUDA

glwagner · 2025-05-12T15:10:37Z

@simone-silvestri curious to hear your thoughts

simone-silvestri · 2025-05-12T17:05:14Z

I think it's a good idea. It provides templates to add new architectures and makes the code completely architecture agnostic. the extra using CUDA is a small price to pay.

src/Utils/versioninfo.jl

glwagner · 2025-05-24T17:23:03Z

@michel2323 let us know when this is ready for prime time

glwagner · 2025-06-11T19:40:54Z

src/DistributedComputations/distributed_architectures.jl

-        isnothing(devices) ? device!(node_rank % ndevices()) : device!(devices[node_rank+1])
+        isnothing(devices) ? device!(child_architecture, node_rank % ndevices(child_architecture)) : device!(child_architecture, devices[node_rank+1])


@simone-silvestri

src/Architectures.jl

michel2323 · 2025-06-11T20:34:01Z

@glwagner For the failing tests, we have 4 in total

oceanangians-distributed: I think it can't find the commit because it's run from a fork.
cpu-turbulence-closure-tests: Bus error. No idea man.
gpu-multi-region-tests: That's the hard one I have to sit on. I dived into it and it's definitely my changes. However, I don't understand what is different at runtime that triggers this.
Documentation something.

michel2323 · 2025-06-12T15:59:37Z

I think the main problem is that I haven't figured out what getdevice actually means across all objects where it is implemented. In particular, there's a bunch of getdevice(somearray). @vchuravy How would that look like with KA? Do the arrays know on which device they are?

vchuravy · 2025-06-12T17:31:29Z

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

michel2323 · 2025-06-12T17:53:32Z

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

@simone-silvestri @glwagner How do you want to proceed with these? Can this be rewritten to only use stuff from Architectures?

https://github.com/michel2323/Oceananigans.jl/blob/2e5f75498e8fa7896a91241351c0e2bac9904adc/src/Utils/multi_region_transformation.jl#L54-L70

* versioninfo * dispatch fixes * Enable tests * Field broadcast fix

michel2323 · 2025-06-19T18:23:02Z

@michel2323 let us know when this is ready for prime time

Finally ready. The documentation breaks due to something unrelated I think. buildkite/oceananigans-distributed isn't run because the code comes from a fork.

vchuravy · 2025-06-25T06:48:16Z

src/Architectures.jl

 device(a::GPU) = a.device
+device!(::CPU, i) = KA.device!(CPU(), i+1)
+device!(::CPU) = nothing


Single arg device!?

This is due to how the CPU was handled as a device in the multi-region code(e.g., here and here). As discussed with @glwagner, I tried not to touch the multi-region code. This will be fixed in an upcoming PR together with a consistent terminology for backend, arch, and device.

@vchuravy we may want to chat, but as far as I can tell there is basically an inconsistency in the concept of "device" in Oceananigans (we use the backend and "device id" interchangeably). It's a mess and we need to clean this up; I was thinking we would do this in a subsequent PR, but another option is to do it prior to this PR...

@michel2323 I don't see a call to device! in the code you linked. There is a call to switch_device! -- is that a synonym for device!?

vchuravy · 2025-06-25T06:49:58Z

src/Utils/multi_region_transformation.jl

+            break
+        end
+    end
+    if backend isa Nothing


Suggested change

if backend isa Nothing

if backend === nothing

vchuravy · 2025-06-25T06:50:41Z

src/Utils/multi_region_transformation.jl

    regional_return_values = Vector(undef, length(devs))
    for (r, dev) in enumerate(devs)
-        switch_device!(dev)
+        # switch_device!(dev)


You still need to switch devices don't you?

we plan to discontinue support for multi-device MultiRegion (eg move towards the requirement that all multi-device code uses MPI). I don't think multi-device functionality is tested either...

vchuravy · 2025-06-25T06:51:50Z

src/Utils/versioninfo.jl

+    if isdefined(Main, :CUDA)
+        try
+            return versioninfo_with_gpu(GPU())
+        catch
+            return "No GPU device found."
+        end
+    else
+        return ""


JuliaGPU/KernelAbstractions.jl#617 (comment)

glwagner · 2025-06-26T16:47:15Z

Seems like this is getting close which is exciting!

src/MultiRegion/multi_region_utils.jl

src/Fields/set!.jl

ext/OceananigansAMDGPUExt.jl

ext/OceananigansCUDAExt.jl

glwagner · 2025-06-27T14:23:22Z

@siddharthabishnu can you take a look at the cubed sphere / multi region stuff here, it will affect you

michel2323 mentioned this pull request May 12, 2025

add KA.get_backend(dev) JuliaGPU/CUDA.jl#2779

Closed

vchuravy reviewed May 12, 2025

View reviewed changes

src/arch_cuda.jl Outdated Show resolved Hide resolved

vchuravy reviewed May 12, 2025

View reviewed changes

src/arch_cuda.jl Outdated Show resolved Hide resolved

vchuravy reviewed May 12, 2025

View reviewed changes

src/arch_cuda.jl Outdated Show resolved Hide resolved

vchuravy reviewed May 12, 2025

View reviewed changes

src/arch_cuda.jl Outdated Show resolved Hide resolved

navidcy added the GPU 👾 Where Oceananigans gets its powers from label May 13, 2025

michel2323 force-pushed the ms/ka branch 2 times, most recently from 69eb545 to 59a441d Compare May 16, 2025 16:08

vchuravy reviewed May 19, 2025

View reviewed changes

src/Utils/versioninfo.jl Outdated Show resolved Hide resolved

michel2323 force-pushed the ms/ka branch from bad037c to 56820c7 Compare May 19, 2025 14:52

michel2323 force-pushed the ms/ka branch 2 times, most recently from 4bbb99e to 3e84145 Compare June 5, 2025 18:04

glwagner mentioned this pull request Jun 10, 2025

Extend FFTBasedPoissonSolver to work on AMDGPU #4593

Open

glwagner reviewed Jun 11, 2025

View reviewed changes

src/Architectures.jl Outdated Show resolved Hide resolved

michel2323 mentioned this pull request Jun 12, 2025

Distributed working with AMD issue #4597

Open

michel2323 force-pushed the ms/ka branch from fc45a83 to 2e5f754 Compare June 12, 2025 15:27

michel2323 force-pushed the ms/ka branch from 266435e to 5e68a49 Compare June 18, 2025 14:14

michel2323 added 4 commits June 19, 2025 10:04

Isolate CUDA

0eeb10f

Create a CUDA extension

146bbec

Add basic CUDA extension test

d000990

Rebase and various fixes

44b312d

* versioninfo * dispatch fixes * Enable tests * Field broadcast fix

michel2323 force-pushed the ms/ka branch from 22126a0 to ae48a9c Compare June 19, 2025 15:05

Fix docs

9b6104b

michel2323 requested a review from glwagner June 19, 2025 18:23

backend -> device for now

36f28c0

vchuravy reviewed Jun 25, 2025

View reviewed changes

vchuravy approved these changes Jun 25, 2025

View reviewed changes

CI, why did I comment out this line?

150bd0f

glwagner added 2 commits June 27, 2025 08:10

Update OceananigansAMDGPUExt.jl

85f154e

Update OceananigansCUDAExt.jl

f862d5f