Scaling of a `ClimaOcean` simulation on multiple GPUs #665

simone-silvestri · 2025-10-20T10:22:57Z

simone-silvestri
Oct 20, 2025
Maintainer

Now that the repository is a bit more mature, there is interest in running global ocean simulations on multiple GPUs.
Unfortunately, MPI simulations are a bit trickier than your typical run-of-the-mill julia simulation, and especially, the performance results might be machine dependent (which, in general, is not what we want).

With @taimoorsohail, we have decided to open a discussion to share nsys reports running the SAME simulation across different machines, so that we can figure out if the performance is machine-dependent or if there is something to fix and where.
This is also interesting for @aklocker42, who is reporting very poor scaling performance trying to run a distributed simulation on the new Norwegian cluster, Olivia.

For whoever wants to investigate the reports we will post on this thread, they can be opened using the Nvidia Nsight System software, which can be downloaded from this link.

The test we have landed on is a one-third-of-a-degree ocean simulation on a tripolar grid.
The script we are running is this one.

using MPI
using CUDA
using NVTX

MPI.Init()
using Oceananigans
using Oceananigans.Units
using Oceananigans.DistributedComputations
using Printf
using ClimaOcean

if MPI.Comm_size(MPI.COMM_WORLD) == 1
    arch = GPU()
else
    arch = Distributed(GPU(); partition = Partition(y = DistributedComputations.Equal()), synchronized_communication=false)
end

function analytical_immersed_grid(underlying_grid; radius = 5, active_cells_map = false) # degrees
    λp = 70
    φp = 55
    φm = -80
    Lz = underlying_grid.Lz
    bottom_height(λ, φ) = ((abs(λ - λp) < radius)       & (abs(φp - φ) < radius)) |
                          ((abs(λ - λp - 180) < radius) & (abs(φp - φ) < radius)) | (φ < φm) ? 0 : - Lz
    grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom_height); active_cells_map)
    return grid
end

Nx, Ny, Nz = 360*3, 180*3, 50

@info "Creating grid"
grid = TripolarGrid(arch; size = (Nx, Ny, Nz), z = (-6000, 0), halo = (7, 7, 4))
grid = analytical_immersed_grid(grid; active_cells_map=true)
free_surface = SplitExplicitFreeSurface(grid; substeps=50)
ocean = ocean_simulation(grid; Δt=1minutes, free_surface, timestepper = :SplitRungeKutta3, closure = RiBasedVerticalDiffusivity())
set!(ocean.model, T=(x, y, z) -> rand()*1e-2 + 20, S=35)

@info "Creating simulation"

radiation  = Radiation(arch)
atmosphere = JRA55PrescribedAtmosphere(arch; backend=JRA55NetCDFBackend(100), include_rivers_and_icebergs=true)

@time coupled_model = OceanSeaIceModel(ocean; atmosphere, radiation)
simulation = Simulation(coupled_model; Δt=10, verbose=false, stop_iteration=100)

@info "Running simulation"

# Precompilation steps
for step in 1:20
   time_step!(simulation)
end

# Make sure all cores are aligned
GC.gc()
MPI.Barrier(MPI.COMM_WORLD)

# Profile!
for step in 1:500
    NVTX.@range "time step" begin
        time_step!(simulation)
        if step % 50 == 0
            GC.gc()
        end
    end
end

To run the script, it is necessary to have an nsight installation on the server, which should be shipped with any CUDA installation.
to make sure we collect data for both kernels, MPI, and GC, nsys should be launched as

... nsys profile --trace=cuda,mpi,nvtx ....

and we should export the environment variable

export JULIA_NVTX_CALLBACKS=gc

I will post here a report of the execution of the code on 1 and 2 GPUs on Tartarus.
Maybe also of interest for @jackdfranklin, @Mikolaj-A-Kowalski, and @TomMelt, who will start looking more in depth at the kernel performance of oceananigans (not climaocean per se), to see the performance characteristics of a typical oceananigans run.

@xkykai is also running on multiple GPUs on Google Cloud. @xkykai, I know you are very busy now :), but if you want to distract yourself for 30 minutes, you can run this minimal test case and share a report.

simone-silvestri · 2025-10-20T14:05:47Z

simone-silvestri
Oct 20, 2025
Maintainer Author

I kick off this thread with 3 reports: the 1 GPU (serial) and the 2 and 3 GPUs (parallel) cases on Tartarus.
The reports can be downloaded from this link.
The median timesteps and scalings are

# GPUs	median timestep walltime	strong scaling efficiency
1	903 ms	n/a
2	478 ms	94.5%
3	326 ms	92.6%

The scaling is rather clean with good communication-computation overlap despite no action has been taken to load balance the domain, leading to a strong scaling efficiency of 94.5% going from 1 to 2 GPUs, and 92.6% from 1 to 3.
That 5% - 7% loss of efficiency can be completely ascribed to (1) asynchronous garbage collection (2) lack of domain load balance (even if very minor).
The system I am running on is the server we use for running CI and has the following specs:

julia> MPI.versioninfo()
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.23
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
  libmpi dlpath:  /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
  MPI version:  3.1.0
  Library version:
    Open MPI v4.1.5, package: Open MPI ssilvest@tartarus Distribution, ident: 4.1.5, repo rev: v4.1.5, Feb 23, 2023
  MPI launcher: mpiexec
  MPI launcher path: /usr/bin/mpiexec

julia> CUDA.versioninfo()
CUDA toolchain:
- runtime 12.9, artifact installation
- driver 535.247.1 for 12.2
- compiler 12.9

CUDA libraries:
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+535.247.1

Julia packages:
- CUDA: 5.9.1
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.2.2+0
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none

4 devices:
  0: NVIDIA TITAN V (sm_70, 8.242 GiB / 12.000 GiB available)
  1: NVIDIA TITAN V (sm_70, 8.106 GiB / 12.000 GiB available)
  2: NVIDIA TITAN V (sm_70, 4.231 GiB / 12.000 GiB available)
  3: NVIDIA TITAN V (sm_70, 11.770 GiB / 12.000 GiB available)

@taimoorsohail waiting for your reports

0 replies

taimoorsohail · 2025-10-21T01:59:27Z

taimoorsohail
Oct 21, 2025
Collaborator

Here are the 1 GPU (serial) and the 2 and 3 GPUs (parallel) cases on Gadi.
The reports can be downloaded from this link.
The median timesteps and scalings are:

GPUs	Median timestep walltime	Strong scaling efficiency
1	799 ms	n/a
2	487 ms	82.0 %
3	378 ms	70.4 %

The scaling efficiencies are far lower on this machine...

The system I am running on has the following specs:

julia> MPI.versioninfo()
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.23
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/openmpi/4.1.7/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v4.1.7, package: Open MPI apps@gadi-cpu-clx-1415.gadi.nci.org.au Distribution, ident: 4.1.7, repo rev: v4.1.7, Oct 31, 2024
  MPI launcher: mpiexec
  MPI launcher path: /apps/openmpi/4.1.7/bin/mpiexec

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.6, local installation
- driver 575.57.8 for 12.9
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+575.57.8

Julia packages: 
- CUDA: 5.8.5
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none 
- JULIA_CUDA_USE_BINARYBUILDER: false

Preferences:
- CUDA_Runtime_jll.version: 12.6
- CUDA_Runtime_jll.local: true

3 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)

0 replies

simone-silvestri · 2025-10-21T08:14:40Z

simone-silvestri
Oct 21, 2025
Maintainer Author

With @taimoorsohail we looked quite thoroughly at the profiles and concluded the problem is not necessarily MPI, rather the fact that the cores proceed in an asynchronous fashion with the MPI step requiring synchronization between the neighboring cores. This issue might be masked in my tests given that the GPUs I test are a little slower. By looking at the profiles it's actually clear that gadi is much faster in terms of memory passing than tartarus, with the cudaMemcpyPeerToPeer functions in the order of milliseconds on tartarus and microseconds on gadi.

@taimoorsohail I have whipped out a load-balancing function which we can test, inspired by the work I did for the scaling of Oceananigans here. I am testing it on my end, even if I don't think I will see much benefit given my scaling was kind of good already. But please give it a try, if it works we can think about implementing something like this directly in oceananigans

using Oceananigans.Architectures: architecture, device, on_architecture
using Oceananigans.DistributedComputations
using Oceananigans.DistributedComputations: reconstruct_global_grid, Sizes, child_architecture
using Oceananigans.Grids: cpu_face_constructor_z, halo_size
using Oceananigans.OrthogonalSphericalShellGrids: TripolarGridOfSomeKind
using Oceananigans.ImmersedBoundaries: immersed_cell
using KernelAbstractions: @kernel, @index

"""
    load_balanced_tripolar_grid(grid::DistributedTripolarGridOfSomeKind)

Return a new tripolar grid whose meridional (y) slabs are redistributed to balance computational load
across ranks.

Arguments
- `grid::DistributedTripolarGridOfSomeKind`: A tripolar grid currently distributed across MPI / worker
  ranks. The function assumes the grid can be reconstructed on a single process to evaluate global
  slab loads.

Returns
- A `TripolarGrid` (or `ImmersedBoundaryGrid` when the input is immersed) with an updated distributed
  architecture that partitions the meridional dimension according to measured per-slab load.

Behavior and details
- This routine only supports redistribution along the meridional (y) axis. If the input architecture
  is already distributed along the zonal (x) axis (i.e., `arch.ranks[1] != 1`), the function emits
  a warning and returns the original grid unchanged.
- The global grid is reconstructed to compute a load metric per y-slab. A device-specific kernel
  (`_assess_y_load`) is invoked to populate `load_per_y_slab`; that metric is used by
  `calculate_local_N` to determine slab sizes for each meridional rank.
- A new `Distributed` architecture is created with `Partition(y = Sizes(...))` based on computed
  local slab counts, and a `TripolarGrid` is constructed using the same vertical faces and conformal
  mapping parameters as the original global grid.

Example
    # Rebalance a distributed tripolar grid along the meridional direction:
    # balanced_grid = load_balanced_tripolar_grid(distributed_grid, bottom_height)
"""
function load_balanced_tripolar_grid(serial_grid::TripolarGridOfSomeKind, arch::Distributed)

    if arch.ranks[1] != 1
        @warn "Tripolar grid load balancing currently only supports distribution along the meridional dimension. \n The grid will not be load balanced."
        return serial_grid
    end

    child_arch  = architecture(serial_grid)

    # Computing the total load per meridional slab
    Nx, Ny, Nz = size(serial_grid)
    Hx, Hy, Hz = halo_size(serial_grid)
    load_per_y_slab = on_architecture(child_arch, zeros(Int, Ny))

    loop! = _assess_y_load!(device(child_arch), 512, Ny)
    loop!(load_per_y_slab, serial_grid)

    load_per_y_slab = on_architecture(CPU(), load_per_y_slab)
    local_Ny        = calculate_local_N(load_per_y_slab, Ny, arch.ranks[2])

    balanced_arch = Distributed(child_arch, partition = Partition(y = Sizes(local_Ny...)))
    meridional_rank = balanced_arch.local_index[2]

    @handshake @info "Y - slab decomposition with $(local_Ny[meridional_rank]) elements for rank $meridional_rank "

    z_faces = cpu_face_constructor_z(serial_grid)

    grid = TripolarGrid(balanced_arch;
                        size = (Nx, Ny, Nz),
                        halo = (Hx, Hy, Hz),
                        z = z_faces,
                        southernmost_latitude = serial_grid.conformal_mapping.southernmost_latitude,
                        first_pole_longitude  = serial_grid.conformal_mapping.first_pole_longitude,
                        north_poles_latitude  = serial_grid.conformal_mapping.north_poles_latitude)

    if serial_grid isa ImmersedBoundaryGrid
        grid = ImmersedBoundaryGrid(grid, GridFittedBottom(serial_grid.immersed_boundary.bottom_height); active_cells_map = true)
    end

    return grid
end

@kernel function _assess_y_load!(load_per_slab, grid)
    j = @index(Global, Linear)
    for i in 1:size(grid, 1)
        for k in 1:size(grid, 3)
            @inbounds load_per_slab[j] += ifelse(immersed_cell(i, j, k, grid), 0, 1)
        end
    end
end

function calculate_local_N(load_per_slab, N, ranks)
    active_cells = sum(load_per_slab)
    active_load  = active_cells / ranks
    local_N = zeros(Int, ranks) # fill the local N with the active load
    idx = 1
    for r in 1:ranks-1
        local_load = 0
        while local_load <= active_load
            local_load += load_per_slab[idx]
            local_N[r] += 1
            idx += 1
        end
    end

    local_N[end] = N - sum(local_N[1:end-1])

    return local_N
end

note that, with this function, the grid above now has to built like

@info "Creating grid"
serial_grid = TripolarGrid(child_architecture(arch); size = (Nx, Ny, Nz), z = (-6000, 0), halo = (7, 7, 4))
serial_grid = analytical_immersed_grid(serial_grid)
grid = load_balanced_tripolar_grid(serial_grid, arch)

for example, if using 2 GPUs, a load balanced grid with the settings above will have

arch = Distributed(GPU, partition = Partition(y = Sizes(263, 277))

which means that GPU-0 should have 5% fewer elements than GPU-1
The situation escalates more for 3 GPUs where we have

[ Info: Y - slab decomposition with 176 elements for rank 1
[ Info: Y - slab decomposition with 175 elements for rank 2
[ Info: Y - slab decomposition with 189 elements for rank 3

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scaling of a `ClimaOcean` simulation on multiple GPUs #665

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Scaling of a ClimaOcean simulation on multiple GPUs #665

Uh oh!

Uh oh!

simone-silvestri Oct 20, 2025 Maintainer

Replies: 3 comments

Uh oh!

Uh oh!

simone-silvestri Oct 20, 2025 Maintainer Author

Uh oh!

taimoorsohail Oct 21, 2025 Collaborator

Uh oh!

Uh oh!

simone-silvestri Oct 21, 2025 Maintainer Author

Scaling of a `ClimaOcean` simulation on multiple GPUs #665

simone-silvestri
Oct 20, 2025
Maintainer

simone-silvestri
Oct 20, 2025
Maintainer Author

taimoorsohail
Oct 21, 2025
Collaborator

simone-silvestri
Oct 21, 2025
Maintainer Author