Skip to content

Tests for Distributed architecture with AnelasticDynamics#441

Open
glwagner wants to merge 29 commits into
mainfrom
glw/distributed-tests
Open

Tests for Distributed architecture with AnelasticDynamics#441
glwagner wants to merge 29 commits into
mainfrom
glw/distributed-tests

Conversation

@glwagner
Copy link
Copy Markdown
Member

CompressibleDynamics do not work yet.

glwagner and others added 10 commits January 14, 2026 12:12
Add true multi-rank (mpiexec -n 4) tests that validate AtmosphereModel
across Partition(4,1,1), Partition(2,2,1), and Partition(1,4,1) using
JLD2Writer/FieldTimeSeries for distributed output comparison.

Fixes required for distributed support:
- Use DistributedFourierTridiagonalPoissonSolver for multi-rank grids
  with FullyConnected topology
- Skip MPI halo communication for column fields (reference state
  profiles) via only_local_halos=true
- Add architecture(::AtmosphereModel) so JLD2Writer adds rank suffixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CompressibleDynamics requires hydrostatically balanced density
initialization and a small fixed Δt (acoustic CFL limit), so the
test helpers are refactored to dispatch on dynamics_type for initial
conditions, time stepping, and MPI script generation.

Tests cover single-rank inline comparison and multi-rank MPI (mpiexec
-n 4) with all three partition layouts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ogies

AtmosphereModel validates that periodic dimensions have halo >= 2, but
Oceananigans also requires halo <= size. Tests with size=1 in periodic x/y
dimensions cannot satisfy both constraints. Fix by increasing grid sizes
from 1 to 2 in horizontal dimensions, or switching to Flat topology for
truly 1D column tests (dcmip2016_kessler, diagnostics hydrostatic pressure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 14, 2026

Codecov Report

❌ Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/AtmosphereModels/set_to_mean.jl 66.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

glwagner and others added 5 commits February 14, 2026 07:58
Use escape_string() when interpolating file paths into generated Julia
scripts so backslashes in Windows temp paths are not parsed as escape
sequences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use (Flat, Flat, Bounded) topology for the single-cell parcel grid to
satisfy the minimum halo >= 2 requirement for periodic dimensions added
in this PR. A stationary parcel has no spatial extent in x/y, so Flat
is the correct topology.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread test/distributed.jl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this takes a lot to run, 30-40 minutes on average. How about we run it in a single separate job, maybe only Julia v1.12 only on Ubuntu?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's review the tests carefully and only keep exactly what is needed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think i made the tests a lot cheaper (one was doing 1e4 iteration). That said I think a separate job is a good idea.

Comment thread test/distributed.jl Outdated
glwagner and others added 8 commits February 14, 2026 20:14
The temp directory from mktempdir() is cleaned up automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove validate_velocity_interpolation_halo and MINIMUM_VELOCITY_INTERPOLATION_HALO
  (the algorithm no longer requires halo >= 2 for periodic dimensions)
- Remove only_local_halos=true from fill_halo_regions! calls on Flat
  column fields (Flat dimensions don't have MPI communication anyway)
- Remove is_column_field helper (no longer needed)
- Revert test grid sizes from (2, 2, ...) back to (1, 1, ...)
- Revert stationary_parcel_model.jl grid back to original Periodic topology

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove all single-rank Distributed(Partition(1,1)) tests (Section A)
  since they don't actually test MPI communication
- Keep only multi-rank MPI tests launched via mpiexec
- Cap all simulations at stop_iteration=100 instead of stop_time to
  avoid expensive runs (the full physics test was doing ~10k iterations)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add only_local_halos=true to fill_halo_regions! calls on reference state
column fields (Field{Nothing, Nothing, Center}). Oceananigans' distributed
halo communication does not handle Flat-located dimensions correctly,
causing BoundsError when running with --check-bounds=yes on distributed
grids. This affects AnelasticDynamics reference density, pressure, and
temperature fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts:
- atmosphere_model.jl: keep both architecture() method and _timestepper_uses_dynamics
- update_atmosphere_model_state.jl: take main's extended periodic diagnostic indices
- reference_states.jl: take main's generalized hydrostatic profiles with only_local_halos
- test/Project.toml: include both MPI and Logging dependencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@glwagner
Copy link
Copy Markdown
Member Author

@giordano the distributed tests pass locally, but I am not sure where to add them for CI purposes. How should we add the distributed tests?

@giordano
Copy link
Copy Markdown
Member

I'll take a look tomorrow.

@giordano
Copy link
Copy Markdown
Member

giordano commented Mar 18, 2026

the distributed tests pass locally

Not for me:

Output generated during execution of 'distributed':
┌ ERROR: ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: LoadError: BoundsErrorBoundsError: attempt to access : attempt to access BoundsError: attempt to access BoundsError: attempt to access 1×1×22 1×1×22 Array{ArrayFloat64{, Float64, 33}} at index [ at index [1×1×22 Array{Float64, 4:64:63} at index [, , 1×1×22 Array{Float64, 3} at index [1:1, 1:22]1:1, 1:22]4:6
│ Stacktrace:
│   [1]
│ Stacktrace:
│   [1] , 4:61:, 1, 1:22]throw_boundserror(A::Array{Float64, 3}, I::Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}})
│     @ throw_boundserror(A::Array{Float64, 3}, I::Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}})
│ Base    @  ./Baseessentials.jl:15
│ ./ essentials.jl:15 [2] checkbounds
│
│   [2]    @  checkbounds./
│     @abstractarray.jl:699  [inlined]./
│  abstractarray.jl:699 [inlined] [3]
│   view
│  [3]    @  view./
│ subarray.jl:214    @ [inlined] ./
│ subarray.jl:214  [inlined] [4]
│
│ Stacktrace: [4]
│    [1] _fill_west_send_buffer!(c::Array{Float64, 3}, buff::Oceananigans.DistributedComputations.OneDBuffer{Array{Float64, 3}}, Hx::Int64, Hy::Int64, Nx::Int64, Ny::Int64)
│ _fill_west_send_buffer!(c::Array{Float64, 3}, buff::Oceananigans.DistributedComputations.OneDBuffer{Array{Float64, 3}}, Hx::Int64, Hy::Int64, Nx::Int64, Ny::Int64)    @
│      @Oceananigans.DistributedComputations Oceananigans.DistributedComputations  ~/.julia/packages/Oceananigans/br1rI/src/DistributedComputations/~/.julia/packages/Oceananigans/br1rI/src/DistributedComputations/1:communication_buffers.jl:4031communication_buffers.jl:403,
│
│ 1:  22 [5] [5]]  fill_send_buffers!fill_send_buffers!
│     @
│     @  ~/.julia/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:299~/.julia/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:299 [inlined]
│  [inlined]
│    [6] [6]
│ Stacktrace:
│   [1] throw_boundserror(A::Array{Float64, 3}, I::Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}})
│     @ Base ./essentials.jl:15
│   [2] checkbounds
│     @ ./abstractarray.jl:699 [inlined]
│  distributed_fill_halo_event!(::OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, ::Oceananigans.BoundaryConditions.DistributedFillHalo{Oceananigans.BoundaryConditions.WestAndEast}, ::Tuple{Oceananigans.BoundaryConditions.DistributedCommunicationBoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.DistributedComputations.HaloCommunicationRanks{Int64, Int64}}, Oceananigans.BoundaryConditions.DistributedCommunicationBoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.DistributedComputations.HaloCommunicationRanks{Int64, Int64}}}, ::Tuple{Nothing, Nothing, Center}, ::RectilinearGrid{Float64, FullyConnected, Periodic, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Distributed{CPU, false, Partition{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Int64, Tuple{Int64, Int64, Int64}, Oceananigans.DistributedComputations.NeighboringRanks{Int64, Int64, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, MPI.Comm, Vector{MPI.Request}, Base.RefValue{Int64}, Nothing}}, ::Oceananigans.DistributedComputations.CommunicationBuffers{Oceananigans.DistributedComputations.OneDBuffer{Array{Float64, 3}}, Oceananigans.DistributedComputations.OneDBuffer{Array{Float64, 3}}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}; async::Bool, only_local_halos::Bool, kwargs::@Kwargs{}) [3]
│      @view
│     @Oceananigans.DistributedComputations distributed_fill_halo_event!(::OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, ::Oceananigans.BoundaryConditions.DistributedFillHalo{Oceananigans.Bo
[...]
Error in testset CompressibleDynamics — Partition(4, 1, 1):
Test Failed at /Users/mose/.julia/dev/Breeze/test/distributed.jl:227
  Expression: isapprox(interior(fts_serial[n]), interior(fts_distributed[n]); rtol, atol)
   Evaluated: isapprox([1.1680322885513306 1.1682422161102295 … 1.1682422161102295 1.1680322885513306; 1.1682422161102295 1.1684203147888184 … 1.1684203147888184 1.1682422161102295; … ; 1.1682422161102295 1.1684203147888184 … 1.1684203147888184 1.1682422161102295; 1.1680322885513306 1.1682422161102295 … 1.1682422161102295 1.1680322885513306;;; 1.1621028184890747 1.1623116731643677 … 1.1623116731643677 1.1621028184890747; 1.1623116731643677 1.1624889373779297 … 1.1624889373779297 1.1623116731643677; … ; 1.1623116731643677 1.1624889373779297 … 1.1624889373779297 1.1623116731643677; 1.1621028184890747 1.1623116731643677 … 1.1623116731643677 1.1621028184890747;;; 1.1561819314956665 1.1563897132873535 … 1.1563897132873535 1.1561819314956665; 1.1563897132873535 1.1565660238265991 … 1.1565660238265991 1.1563897132873535; … ; 1.1563897132873535 1.1565660238265991 … 1.1565660238265991 1.1563897132873535; 1.1561819314956665 1.1563897132873535 … 1.1563897132873535 1.1561819314956665;;; … ;;; 1.0922424793243408 1.0924389362335205 … 1.0924389362335205 1.0922424793243408; 1.0924389362335205 1.0926055908203125 … 1.0926055908203125 1.0924389362335205; … ; 1.0924389362335205 1.0926055908203125 … 1.0926055908203125 1.0924389362335205; 1.0922424793243408 1.0924389362335205 … 1.0924389362335205 1.0922424793243408;;; 1.0865380764007568 1.0867334604263306 … 1.0867334604263306 1.0865380764007568; 1.0867334604263306 1.0868991613388062 … 1.0868991613388062 1.0867334604263306; … ; 1.0867334604263306 1.0868991613388062 … 1.0868991613388062 1.0867334604263306; 1.0865380764007568 1.0867334604263306 … 1.0867334604263306 1.0865380764007568;;; 1.0808604955673218 1.081054925918579 … 1.081054925918579 1.0808604955673218; 1.081054925918579 1.0812197923660278 … 1.0812197923660278 1.081054925918579; … ; 1.081054925918579 1.0812197923660278 … 1.0812197923660278 1.081054925918579; 1.0808604955673218 1.081054925918579 … 1.081054925918579 1.0808604955673218], [1.1680322885513306 1.1682422161102295 … 1.1682422161102295 1.1680322885513306; 1.1682422161102295 1.1684203147888184 … 1.1684203147888184 1.1682422161102295; … ; 1.1682422161102295 1.1684203147888184 … 1.1684203147888184 1.1682422161102295; 1.1680322885513306 1.1682422161102295 … 1.1682422161102295 1.1680322885513306;;; 1.1621028184890747 1.1623116731643677 … 1.1623116731643677 1.1621028184890747; 1.1623116731643677 1.1624889373779297 … 1.1624889373779297 1.1623116731643677; … ; 1.1623116731643677 1.1624889373779297 … 1.1624889373779297 1.1623116731643677; 1.1621028184890747 1.1623116731643677 … 1.1623116731643677 1.1621028184890747;;; 1.1561819314956665 1.1563897132873535 … 1.1563897132873535 1.1561819314956665; 1.1563897132873535 1.1565660238265991 … 1.1565660238265991 1.1563897132873535; … ; 1.1563897132873535 1.1565660238265991 … 1.1565660238265991 1.1563897132873535; 1.1561819314956665 1.1563897132873535 … 1.1563897132873535 1.1561819314956665;;; … ;;; 1.0922424793243408 1.0924389362335205 … 1.0924389362335205 1.0922424793243408; 1.0924389362335205 1.0926055908203125 … 1.0926055908203125 1.0924389362335205; … ; 1.0924389362335205 1.0926055908203125 … 1.0926055908203125 1.0924389362335205; 1.0922424793243408 1.0924389362335205 … 1.0924389362335205 1.0922424793243408;;; 1.0865380764007568 1.0867334604263306 … 1.0867334604263306 1.0865380764007568; 1.0867334604263306 1.0868991613388062 … 1.0868991613388062 1.0867334604263306; … ; 1.0867334604263306 1.0868991613388062 … 1.0868991613388062 1.0867334604263306; 1.0865380764007568 1.0867334604263306 … 1.0867334604263306 1.0865380764007568;;; 1.0808604955673218 1.081054925918579 … 1.081054925918579 1.0808604955673218; 1.081054925918579 1.0812197923660278 … 1.0812197923660278 1.081054925918579; … ; 1.081054925918579 1.0812197923660278 … 1.0812197923660278 1.081054925918579; 1.0808604955673218 1.081054925918579 … 1.081054925918579 1.0808604955673218]; rtol = 2.220446049250313e-15, atol = 2.220446049250313e-14)

Error in testset CompressibleDynamics — Partition(4, 1, 1):
Test Failed at /Users/mose/.julia/dev/Breeze/test/distributed.jl:227
  Expression: isapprox(interior(fts_serial[n]), interior(fts_distributed[n]); rtol, atol)
   Evaluated: isapprox([1.1673718690872192 1.167680025100708 … 1.167680025100708 1.1673718690872192; 1.167680025100708 1.167941689491272 … 1.167941689491272 1.167680025100708; … ; 1.167680025100708 1.167941689491272 … 1.167941689491272 1.167680025100708; 1.1673718690872192 1.167680025100708 … 1.167680025100708 1.1673718690872192;;; 1.1614562273025513 1.1617629528045654 … 1.1617629528045654 1.1614562273025513; 1.1617629528045654 1.1620233058929443 … 1.1620233058929443 1.1617629528045654; … ; 1.1617629528045654 1.1620233058929443 … 1.1620233058929443 1.1617629528045654; 1.1614562273025513 1.1617629528045654 … 1.1617629528045654 1.1614562273025513;;; 1.1555393934249878 1.1558445692062378 … 1.1558445692062378 1.1555393934249878; 1.1558445692062378 1.1561037302017212 … 1.1561037302017212 1.1558445692062378; … ; 1.1558445692062378 1.1561037302017212 … 1.1561037302017212 1.1558445692062378; 1.1555393934249878 1.1558445692062378 … 1.1558445692062378 1.1555393934249878;;; … ;;; 1.0916345119476318 1.0919231176376343 … 1.0919231176376343 1.0916345119476318; 1.0919231176376343 1.0921680927276611 … 1.0921680927276611 1.0919231176376343; … ; 1.0919231176376343 1.0921680927276611 … 1.0921680927276611 1.0919231176376343; 1.0916345119476318 1.0919231176376343 … 1.0919231176376343 1.0916345119476318;;; 1.0859341621398926 1.0862212181091309 … 1.0862212181091309 1.0859341621398926; 1.0862212181091309 1.0864648818969727 … 1.0864648818969727 1.0862212181091309; … ; 1.0862212181091309 1.0864648818969727 … 1.0864648818969727 1.0862212181091309; 1.0859341621398926 1.0862212181091309 … 1.0862212181091309 1.0859341621398926;;; 1.0802695751190186 1.0805552005767822 … 1.0805552005767822 1.0802695751190186; 1.0805552005767822 1.0807976722717285 … 1.0807976722717285 1.0805552005767822; … ; 1.0805552005767822 1.0807976722717285 … 1.0807976722717285 1.0805552005767822; 1.0802695751190186 1.0805552005767822 … 1.0805552005767822 1.0802695751190186], [1.1673718690872192 1.167680025100708 … 1.167680025100708 1.1673718690872192; 1.167680025100708 1.167941689491272 … 1.167941689491272 1.167680025100708; … ; 1.167680025100708 1.167941689491272 … 1.167941689491272 1.167680025100708; 1.1673718690872192 1.167680025100708 … 1.167680025100708 1.1673718690872192;;; 1.1614562273025513 1.1617629528045654 … 1.1617629528045654 1.1614562273025513; 1.1617629528045654 1.1620233058929443 … 1.1620233058929443 1.1617629528045654; … ; 1.1617629528045654 1.1620233058929443 … 1.1620233058929443 1.1617629528045654; 1.1614562273025513 1.1617629528045654 … 1.1617629528045654 1.1614562273025513;;; 1.1555393934249878 1.1558445692062378 … 1.1558445692062378 1.1555393934249878; 1.1558445692062378 1.1561037302017212 … 1.1561037302017212 1.1558445692062378; … ; 1.1558445692062378 1.1561037302017212 … 1.1561037302017212 1.1558445692062378; 1.1555393934249878 1.1558445692062378 … 1.1558445692062378 1.1555393934249878;;; … ;;; 1.0916345119476318 1.0919231176376343 … 1.0919231176376343 1.0916345119476318; 1.0919231176376343 1.0921680927276611 … 1.0921680927276611 1.0919231176376343; … ; 1.0919231176376343 1.0921680927276611 … 1.0921680927276611 1.0919231176376343; 1.0916345119476318 1.0919231176376343 … 1.0919231176376343 1.0916345119476318;;; 1.0859341621398926 1.0862212181091309 … 1.0862212181091309 1.0859341621398926; 1.0862212181091309 1.0864648818969727 … 1.0864648818969727 1.0862212181091309; … ; 1.0862212181091309 1.0864648818969727 … 1.0864648818969727 1.0862212181091309; 1.0859341621398926 1.0862212181091309 … 1.0862212181091309 1.0859341621398926;;; 1.0802695751190186 1.0805552005767822 … 1.0805552005767822 1.0802695751190186; 1.0805552005767822 1.0807976722717285 … 1.0807976722717285 1.0805552005767822; … ; 1.0805552005767822 1.0807976722717285 … 1.0807976722717285 1.0805552005767822; 1.0802695751190186 1.0805552005767822 … 1.0805552005767822 1.0802695751190186]; rtol = 2.220446049250313e-15, atol = 2.220446049250313e-14)

etc for loads of tests.

For reference:

julia> versioninfo()
Julia Version 1.12.5
Commit 5fe89b8ddc1 (2026-02-09 16:05 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m1)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 4 virtual cores)

@giordano giordano added testing 🧪 distributed 🔀 the more the merrier, as they say labels Mar 18, 2026
@fergu
Copy link
Copy Markdown
Contributor

fergu commented Mar 18, 2026

I see an identical error on the main branch, too (using a case I'm working on - not a test) with a Distributed( GPU() ) run.

ERROR: ERROR: LoadError: LoadError: ERROR: LoadError: ERROR: LoadError: BoundsError: attempt to access BoundsError: attempt to access BoundsError: attempt to access BoundsError: attempt to access 1×1×134 CuArray{Float64, 3, CUDA.DeviceMemory} at index [1×1×134 CuArray{Float64, 3, CUDA.DeviceMemory} at index [4:1634:163, 4:6, , 4:6, 1:134]1:134]
Stacktrace:
  [1]
Stacktrace:
  [1] throw_boundserror(A::CuArray{Float64, 3, CUDA.DeviceMemory}, I::Tuple{UnitRange{Int64}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}})
    @ Base ./essentials.jl:15
  [2] checkbounds
    @ ./abstractarray.jl:699 [inlined]
  [3] view
    @ /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/GPUArrays/3a5jB/src/host/base.jl:348 [inlined]
  [4] _fill_south_send_buffer!(c::CuArray{Float64, 3, CUDA.DeviceMemory}, buff::Oceananigans.DistributedComputations.TwoDBuffer{CuArray{Float64, 3, CUDA.DeviceMemory}}, Hx::Int64, Hy::Int64, Nx::Int64, Ny::Int64)
    @ Oceananigans.DistributedComputations /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:419
  [5] fill_send_buffers!
    @ /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:307 [inlined]
  [6] throw_boundserror(A::CuArray{Float64, 3, CUDA.DeviceMemory}, I::Tuple{UnitRange{Int64}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}})
    @ Base ./essentials.jl:15
  [2] checkbounds
    @ ./abstractarray.jl:699 [inlined]
  [3] view
    @ /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/GPUArrays/3a5jB/src/host/base.jl:348 [inlined]
  [4] _fill_south_send_buffer!(c::CuArray{Float64, 3, CUDA.DeviceMemory}, buff::Oceananigans.DistributedComputations.TwoDBuffer{CuArray{Float64, 3, CUDA.DeviceMemory}}, Hx::Int64, Hy::Int64, Nx::Int64, Ny::Int64)
    @ Oceananigans.DistributedComputations /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:419
  [5] fill_send_buffers!
    @ /glade/work/kjferguson/apps/derecho/julia/juliaup/depot/packages/Oceananigans/br1rI/src/DistributedComputations/communication_buffers.jl:307 [inlined]
[...]

There's some messages printing on top of each other due to MPI, but that _fill_[*]_send_buffer() call shows up in both.

julia> versioninfo()
Julia Version 1.12.5
Commit 5fe89b8ddc1 (2026-02-09 16:05 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7543 32-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver3)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 128 virtual cores)
Environment:
  <snip - contains environment variables to use system installed MPI and CUDA on NCAR's Derecho>

@fergu
Copy link
Copy Markdown
Contributor

fergu commented Mar 18, 2026

In my particular case, the error is thrown at this line:

state = ReferenceState( grid,
                           constants,
                           potential_temperature = 300.0 )

Looking at the callback stack, it looks like the call stack leaves Breeze.jl for Oceananigans.jl at line 353 of Thermodynamics.jl. Since this (I assume) works in Oceananigans, the conflict must be somewhere upstream of that?

Just for completeness,

grid = RectilinearGrid( Distributed( GPU(), partition=Partition(; x=2, y=2, z=1) );
                           size=(320,320,128),
                           x=(0,5120.0),
                           y=(0,5120.0),
                           z=(0,2048.0),
                           topology=(Periodic, Periodic, Bounded) )

and

constants = ThermodynamicConstants()

@fergu
Copy link
Copy Markdown
Contributor

fergu commented Mar 18, 2026

Sorry - I just saw glw/distributed-tests addressed this issue using only_local_halos 🤪. It looks like there was two spots in ReferenceState() where that kwarg wasn't added to Field{Nothing, Nothing, Center}'s. Adding those in let my case get up to setting up an AtmosphereModel( ... ), though MPI throws an assertion error about a bad address at that point. However, that might be specific to my environment since I'm relying on MPITrampoline and I'm not totally sure I've got that set up right yet.

I do see that 0526c73 explicitly removed only_local_halos=true from those lines - so maybe I'm not thinking of this right?

glwagner and others added 4 commits March 20, 2026 13:54
- Remove fill_halo_regions!(dynamics.density) and
  fill_halo_regions!(prognostic_fields(model.formulation)) from
  compute_auxiliary_dynamics_variables! for CompressibleDynamics.
  Both fields are prognostic and already filled by the async
  prognostic fill in update_state!, synchronized by the momentum
  fill in compute_velocities!.

- Remove fill_halo_regions!(model.formulation) from
  compute_auxiliary_thermodynamic_variables!. The formulation
  (potential temperature density) is a prognostic field already
  included in the async prognostic fill.

Eliminates 3 redundant halo fills per RK3 stage (9 per time step),
reducing distributed communication overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@glwagner
Copy link
Copy Markdown
Member Author

In my particular case, the error is thrown at this line:

state = ReferenceState( grid,
                           constants,
                           potential_temperature = 300.0 )

Looking at the callback stack, it looks like the call stack leaves Breeze.jl for Oceananigans.jl at line 353 of Thermodynamics.jl. Since this (I assume) works in Oceananigans, the conflict must be somewhere upstream of that?

Just for completeness,

grid = RectilinearGrid( Distributed( GPU(), partition=Partition(; x=2, y=2, z=1) );
                           size=(320,320,128),
                           x=(0,5120.0),
                           y=(0,5120.0),
                           z=(0,2048.0),
                           topology=(Periodic, Periodic, Bounded) )

and

constants = ThermodynamicConstants()

I think this is a legit bug this is fixed by CliMA/Oceananigans.jl#5434

For CompressibleDynamics, temperature is recomputed from the equation
of state in compute_auxiliary_dynamics_variables! which fills T halos.
The earlier fill in compute_auxiliary_thermodynamic_variables! is
redundant and wastes one fill_halo_regions! per RK3 stage.

Add _fill_thermodynamic_halos! dispatch: default fills T halos,
CompressibleDynamics method is a no-op. Saves 3 halo fills per
timestep (1 per stage × 3 stages).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed 🔀 the more the merrier, as they say testing 🧪

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants