Scaling of a ClimaOcean simulation on multiple GPUs
#665
Replies: 3 comments
-
|
I kick off this thread with 3 reports: the 1 GPU (serial) and the 2 and 3 GPUs (parallel) cases on Tartarus.
The scaling is rather clean with good communication-computation overlap despite no action has been taken to load balance the domain, leading to a strong scaling efficiency of 94.5% going from 1 to 2 GPUs, and 92.6% from 1 to 3. julia> MPI.versioninfo()
MPIPreferences:
binary: system
abi: OpenMPI
libmpi: /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
mpiexec: mpiexec
Package versions
MPI.jl: 0.20.23
MPIPreferences.jl: 0.1.11
Library information:
libmpi: /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
libmpi dlpath: /storage4/simone/TestsScaling/openmpi/lib/libmpi.so
MPI version: 3.1.0
Library version:
Open MPI v4.1.5, package: Open MPI ssilvest@tartarus Distribution, ident: 4.1.5, repo rev: v4.1.5, Feb 23, 2023
MPI launcher: mpiexec
MPI launcher path: /usr/bin/mpiexec
julia> CUDA.versioninfo()
CUDA toolchain:
- runtime 12.9, artifact installation
- driver 535.247.1 for 12.2
- compiler 12.9
CUDA libraries:
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+535.247.1
Julia packages:
- CUDA: 5.9.1
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.2.2+0
- CUDA_Runtime_jll: 0.19.2+0
Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7
Environment:
- JULIA_CUDA_MEMORY_POOL: none
4 devices:
0: NVIDIA TITAN V (sm_70, 8.242 GiB / 12.000 GiB available)
1: NVIDIA TITAN V (sm_70, 8.106 GiB / 12.000 GiB available)
2: NVIDIA TITAN V (sm_70, 4.231 GiB / 12.000 GiB available)
3: NVIDIA TITAN V (sm_70, 11.770 GiB / 12.000 GiB available)
@taimoorsohail waiting for your reports |
Beta Was this translation helpful? Give feedback.
-
|
Here are the 1 GPU (serial) and the 2 and 3 GPUs (parallel) cases on Gadi.
The scaling efficiencies are far lower on this machine... The system I am running on has the following specs: julia> MPI.versioninfo()
MPIPreferences:
binary: system
abi: OpenMPI
libmpi: libmpi
mpiexec: mpiexec
Package versions
MPI.jl: 0.20.23
MPIPreferences.jl: 0.1.11
Library information:
libmpi: libmpi
libmpi dlpath: /apps/openmpi/4.1.7/lib/libmpi.so
MPI version: 3.1.0
Library version:
Open MPI v4.1.7, package: Open MPI apps@gadi-cpu-clx-1415.gadi.nci.org.au Distribution, ident: 4.1.7, repo rev: v4.1.7, Oct 31, 2024
MPI launcher: mpiexec
MPI launcher path: /apps/openmpi/4.1.7/bin/mpiexec
julia> CUDA.versioninfo()
CUDA toolchain:
- runtime 12.6, local installation
- driver 575.57.8 for 12.9
- compiler 12.9
CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+575.57.8
Julia packages:
- CUDA: 5.8.5
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7
Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false
Preferences:
- CUDA_Runtime_jll.version: 12.6
- CUDA_Runtime_jll.local: true
3 devices:
0: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
1: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
2: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available) |
Beta Was this translation helpful? Give feedback.
-
|
With @taimoorsohail we looked quite thoroughly at the profiles and concluded the problem is not necessarily MPI, rather the fact that the cores proceed in an asynchronous fashion with the MPI step requiring synchronization between the neighboring cores. This issue might be masked in my tests given that the GPUs I test are a little slower. By looking at the profiles it's actually clear that gadi is much faster in terms of memory passing than tartarus, with the cudaMemcpyPeerToPeer functions in the order of milliseconds on tartarus and microseconds on gadi. @taimoorsohail I have whipped out a load-balancing function which we can test, inspired by the work I did for the scaling of Oceananigans here. I am testing it on my end, even if I don't think I will see much benefit given my scaling was kind of good already. But please give it a try, if it works we can think about implementing something like this directly in oceananigans using Oceananigans.Architectures: architecture, device, on_architecture
using Oceananigans.DistributedComputations
using Oceananigans.DistributedComputations: reconstruct_global_grid, Sizes, child_architecture
using Oceananigans.Grids: cpu_face_constructor_z, halo_size
using Oceananigans.OrthogonalSphericalShellGrids: TripolarGridOfSomeKind
using Oceananigans.ImmersedBoundaries: immersed_cell
using KernelAbstractions: @kernel, @index
"""
load_balanced_tripolar_grid(grid::DistributedTripolarGridOfSomeKind)
Return a new tripolar grid whose meridional (y) slabs are redistributed to balance computational load
across ranks.
Arguments
- `grid::DistributedTripolarGridOfSomeKind`: A tripolar grid currently distributed across MPI / worker
ranks. The function assumes the grid can be reconstructed on a single process to evaluate global
slab loads.
Returns
- A `TripolarGrid` (or `ImmersedBoundaryGrid` when the input is immersed) with an updated distributed
architecture that partitions the meridional dimension according to measured per-slab load.
Behavior and details
- This routine only supports redistribution along the meridional (y) axis. If the input architecture
is already distributed along the zonal (x) axis (i.e., `arch.ranks[1] != 1`), the function emits
a warning and returns the original grid unchanged.
- The global grid is reconstructed to compute a load metric per y-slab. A device-specific kernel
(`_assess_y_load`) is invoked to populate `load_per_y_slab`; that metric is used by
`calculate_local_N` to determine slab sizes for each meridional rank.
- A new `Distributed` architecture is created with `Partition(y = Sizes(...))` based on computed
local slab counts, and a `TripolarGrid` is constructed using the same vertical faces and conformal
mapping parameters as the original global grid.
Example
# Rebalance a distributed tripolar grid along the meridional direction:
# balanced_grid = load_balanced_tripolar_grid(distributed_grid, bottom_height)
"""
function load_balanced_tripolar_grid(serial_grid::TripolarGridOfSomeKind, arch::Distributed)
if arch.ranks[1] != 1
@warn "Tripolar grid load balancing currently only supports distribution along the meridional dimension. \n The grid will not be load balanced."
return serial_grid
end
child_arch = architecture(serial_grid)
# Computing the total load per meridional slab
Nx, Ny, Nz = size(serial_grid)
Hx, Hy, Hz = halo_size(serial_grid)
load_per_y_slab = on_architecture(child_arch, zeros(Int, Ny))
loop! = _assess_y_load!(device(child_arch), 512, Ny)
loop!(load_per_y_slab, serial_grid)
load_per_y_slab = on_architecture(CPU(), load_per_y_slab)
local_Ny = calculate_local_N(load_per_y_slab, Ny, arch.ranks[2])
balanced_arch = Distributed(child_arch, partition = Partition(y = Sizes(local_Ny...)))
meridional_rank = balanced_arch.local_index[2]
@handshake @info "Y - slab decomposition with $(local_Ny[meridional_rank]) elements for rank $meridional_rank "
z_faces = cpu_face_constructor_z(serial_grid)
grid = TripolarGrid(balanced_arch;
size = (Nx, Ny, Nz),
halo = (Hx, Hy, Hz),
z = z_faces,
southernmost_latitude = serial_grid.conformal_mapping.southernmost_latitude,
first_pole_longitude = serial_grid.conformal_mapping.first_pole_longitude,
north_poles_latitude = serial_grid.conformal_mapping.north_poles_latitude)
if serial_grid isa ImmersedBoundaryGrid
grid = ImmersedBoundaryGrid(grid, GridFittedBottom(serial_grid.immersed_boundary.bottom_height); active_cells_map = true)
end
return grid
end
@kernel function _assess_y_load!(load_per_slab, grid)
j = @index(Global, Linear)
for i in 1:size(grid, 1)
for k in 1:size(grid, 3)
@inbounds load_per_slab[j] += ifelse(immersed_cell(i, j, k, grid), 0, 1)
end
end
end
function calculate_local_N(load_per_slab, N, ranks)
active_cells = sum(load_per_slab)
active_load = active_cells / ranks
local_N = zeros(Int, ranks) # fill the local N with the active load
idx = 1
for r in 1:ranks-1
local_load = 0
while local_load <= active_load
local_load += load_per_slab[idx]
local_N[r] += 1
idx += 1
end
end
local_N[end] = N - sum(local_N[1:end-1])
return local_N
endnote that, with this function, the grid above now has to built like @info "Creating grid"
serial_grid = TripolarGrid(child_architecture(arch); size = (Nx, Ny, Nz), z = (-6000, 0), halo = (7, 7, 4))
serial_grid = analytical_immersed_grid(serial_grid)
grid = load_balanced_tripolar_grid(serial_grid, arch)for example, if using 2 GPUs, a load balanced grid with the settings above will have arch = Distributed(GPU, partition = Partition(y = Sizes(263, 277))which means that GPU-0 should have 5% fewer elements than GPU-1 [ Info: Y - slab decomposition with 176 elements for rank 1
[ Info: Y - slab decomposition with 175 elements for rank 2
[ Info: Y - slab decomposition with 189 elements for rank 3 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Now that the repository is a bit more mature, there is interest in running global ocean simulations on multiple GPUs.
Unfortunately, MPI simulations are a bit trickier than your typical run-of-the-mill julia simulation, and especially, the performance results might be machine dependent (which, in general, is not what we want).
With @taimoorsohail, we have decided to open a discussion to share
nsysreports running the SAME simulation across different machines, so that we can figure out if the performance is machine-dependent or if there is something to fix and where.This is also interesting for @aklocker42, who is reporting very poor scaling performance trying to run a distributed simulation on the new Norwegian cluster, Olivia.
For whoever wants to investigate the reports we will post on this thread, they can be opened using the Nvidia Nsight System software, which can be downloaded from this link.
The test we have landed on is a one-third-of-a-degree ocean simulation on a tripolar grid.
The script we are running is this one.
To run the script, it is necessary to have an
nsightinstallation on the server, which should be shipped with any CUDA installation.to make sure we collect data for both kernels, MPI, and GC, nsys should be launched as
and we should export the environment variable
I will post here a report of the execution of the code on 1 and 2 GPUs on Tartarus.
Maybe also of interest for @jackdfranklin, @Mikolaj-A-Kowalski, and @TomMelt, who will start looking more in depth at the kernel performance of oceananigans (not climaocean per se), to see the performance characteristics of a typical oceananigans run.
@xkykai is also running on multiple GPUs on Google Cloud. @xkykai, I know you are very busy now :), but if you want to distract yourself for 30 minutes, you can run this minimal test case and share a report.
Beta Was this translation helpful? Give feedback.
All reactions