Skip to content

Add MPI parallelism via ImplicitGlobalGrid extension#281

Open
b-fg wants to merge 9 commits intomasterfrom
mpi-igg
Open

Add MPI parallelism via ImplicitGlobalGrid extension#281
b-fg wants to merge 9 commits intomasterfrom
mpi-igg

Conversation

@b-fg
Copy link
Copy Markdown
Member

@b-fg b-fg commented Apr 15, 2026

Claude-assisted MPI extension using ImplicitGlobalGrid.jl

This branch is a clean implementation of #236 but without including the AbstractBC and AbstractFlow types. It incorporate MPI via IGG keeping it as minimal as possible. Tests have also been adjusted to the new N+4 array sizes.

Some pending tasks are:

  • Fix docstrings
  • Metrics using reductions such as energy or force should reduce globally
  • Change loc offset strategy: OffsetBody wraps AbstractBody, and loc includes offset value
  • Add MPI tests
  • Check multi-GPU
  • Check parallel performance, serial vs mpiexec -n 1, multithreading (KA) + MPI(IGG) (See Add MPI parallelism via ImplicitGlobalGrid extension #281 (comment))
  • Explore hide comms (Mostly relevant on GPUs, but requires big refactor of range functions, effectively splitting them between inner cells and halo cells)
  • Check performance vs master in serial: within noise levels (within +/- 10%) Add MPI parallelism via ImplicitGlobalGrid extension #281 (comment)
  • Check AD in IGG
  • Wait for IGG support on GMG

Summary

  • Introduce AbstractParMode dispatch pattern (Serial/Parallel) for MPI support without method overwriting, preserving precompilation
  • Adopt N+4 staggered grid layout (2 ghost cells per side), making BC! and MPI halo exchange operate on the same 4 indices — no save/restore logic needed
  • Add WaterLilyMPIExt (triggered by ImplicitGlobalGrid + MPI): global reductions, halo exchange (IGG for fine grid, direct MPI for multigrid), exitBC!, wallBC_L!, init_waterlily_mpi, global_offset
  • Dispatch save! in WaterLilyWriteVTKExt through par_mode[] for parallel PVTI output

Design

All MPI hooks dispatch through par_mode[] (defaults to Serial()). The extension sets par_mode[] = Parallel(comm) and adds new methods — no method overwriting. This keeps the serial code path unchanged and lets the extension precompile normally.

The N+4 layout gives symmetric ghost/boundary cells (indices 1,2 left; M-1,M right). BC! sets Dirichlet/Neumann at these indices; the MPI halo overwrites the same indices at rank-internal boundaries. At physical boundaries (no MPI neighbor), BC! values are preserved. This eliminates the need for saving and restoring boundary values around halo exchanges.

Key internal changes:

  • comm!/velocity_comm! replace bare perBC! in the Poisson solver — MPI rank-internal boundaries have L≠0 and need halo exchange regardless of periodicity
  • pin_pressure! removes the global mean inside solver! (not project!), using global_sum/global_length for MPI-consistent pinning
  • wallBC_L! dispatches on par_mode[] so only physical walls get L=0 (MPI-internal faces keep L=1)
  • _apply_offset wraps the body's map function with the rank-local coordinate offset at Simulation construction — user SDFs are written in global coordinates, identical to serial
  • conv_diff! simplified: boundary-specific stencils (ϕuP, ϕuL, ϕuR, lowerBoundary!, upperBoundary!) removed since N+4 ghost cells provide the extra stencil points

Dispatched hooks

Hook Serial Parallel (MPI extension)
global_dot, global_sum, global_length, global_min local operation MPI.Allreduce
global_perdot dot product (full or interior) MPI.Allreduce of local
scalar_halo!, velocity_halo! no-op IGG update_halo! or direct MPI
comm!, velocity_comm! perBC! MPI halo exchange
wallBC_L! zero L at faces 3 and M-1 zero only at physical walls
exitBC! local convective exit global MPI reductions for flux balance
divisible N>4 N>8 (stricter for multigrid)
global_offset zero rank-local origin from IGG

Validation

Serial and parallel (4 ranks) produce quantitatively matching results:

  • 2D circle (Re=250): velocity rms|ΔU| ~ 2.7×10⁻⁵
  • 3D sphere (160×128×128, Re=250): velocity rms|ΔU| ~ 1.1×10⁻⁵

Remaining max|Δp| differences are confined to body-interior cells where the Poisson operator has iD=0 (unconstrained pressure) — a known immersed-boundary characteristic, not a serial/parallel discrepancy.

  Introduce AbstractParMode dispatch pattern for MPI support without
  method overwriting, enabling normal precompilation. All parallel hooks
  (reductions, halo exchange, wall BCs, pressure pinning) dispatch through
  par_mode[] — serial defines Serial() methods, WaterLilyMPIExt adds
  Parallel() methods.

  Key changes:
  - N+4 staggered grid layout (2 ghost cells per side) replacing N+2,
    making BC! and MPI halo exchange operate on the same indices
  - Symmetric BC! that sets indices 1,2 and M-1,M (no save/restore needed)
  - comm!/velocity_comm! hooks separate domain BCs from communication
  - Poisson solver uses global reductions (global_dot, global_sum) and
    comm! instead of local dot products and perBC!
  - pin_pressure! removes global mean in solver! (not project!)
  - wallBC_L! dispatches for MPI-aware physical wall detection
  - _apply_offset wraps body map with rank-local coordinate offset
  - Simplified conv_diff! (removed boundary-specific stencils)
  - VTK writer dispatches save! on par_mode[] for parallel PVTI output
  - WaterLilyMPIExt: halo exchange (IGG for fine grid, direct MPI for
    multigrid), exitBC!, init_waterlily_mpi, global_offset
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 36.26761% with 181 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
ext/WaterLilyMPIExt.jl 0.00% 125 Missing ⚠️
ext/WaterLilyWriteVTKExt.jl 5.71% 33 Missing ⚠️
src/util.jl 81.53% 12 Missing ⚠️
src/Body.jl 37.50% 5 Missing ⚠️
src/WaterLily.jl 78.94% 4 Missing ⚠️
src/Metrics.jl 83.33% 2 Missing ⚠️
Files with missing lines Coverage Δ
src/Flow.jl 100.00% <100.00%> (ø)
src/MultiLevelPoisson.jl 98.64% <100.00%> (ø)
src/Poisson.jl 97.84% <100.00%> (ø)
src/Metrics.jl 89.24% <83.33%> (-0.87%) ⬇️
src/WaterLily.jl 86.53% <78.94%> (+1.24%) ⬆️
src/Body.jl 89.13% <37.50%> (-10.87%) ⬇️
src/util.jl 77.05% <81.53%> (+2.62%) ⬆️
ext/WaterLilyWriteVTKExt.jl 40.00% <5.71%> (-60.00%) ⬇️
ext/WaterLilyMPIExt.jl 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

  Add missing docstrings for exported and internal functions (set_backend,
  global_dot/sum/length/min, scalar_halo!, velocity_halo!, global_perdot,
  L₂, conv_diff!, BDIM!, project!, CFL, _apply_offset, up/down/restrict/
  restrictL/restrictML/restrictL!, Vcycle!, MultiLevelPoisson solver!).

  Fix inaccuracies and typos and update existing docstrings with MPI-relevant details
  - Add _has_neighbors flag to skip halo exchange when no MPI neighbors
    exist (e.g. np=1 non-periodic), eliminating IGG update_halo! overhead
    and unnecessary buffer copies
  - Precompute global_length(inside(x)) in Poisson struct (inslen field)
    to avoid one MPI Allreduce per residual! call
  - Change parallel _divisible threshold from N>8 to N>4 (same as serial)
    since coarse-level comm cost is negligible with the _has_neighbors
    short-circuit
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 15, 2026

Performance — mpi-igg branch

Serial performance vs master

Benchmarked with WaterLily-Benchmarks suite: 100 sim_steps, Float32, CPUx01/CPUx04/GPU-NVIDIA.
Commit a52d84a (mpi-igg) vs 29f9f43 (master, tagged v1.6.1).

TGV (no body, no remeasure)

Backend log2p mpi-igg ms master ms Overhead
CPUx01 6 4.07 4.45 -9% (faster)
CPUx01 7 29.42 28.33 +4%
CPUx04 6 2.81 2.77 +1%
CPUx04 7 17.41 16.97 +3%
GPU 6 0.44 0.40 +8%
GPU 7 2.82 2.67 +6%

Sphere (3D body, no remeasure)

Backend log2p mpi-igg ms master ms Overhead
CPUx01 3 4.51 4.13 +9%
CPUx01 4 40.63 39.15 +4%
CPUx04 3 2.50 2.37 +5%
CPUx04 4 17.74 17.28 +3%
GPU 3 0.41 0.38 +8%
GPU 4 3.04 2.86 +6%

Jelly (3D body, remeasure every step)

Backend log2p mpi-igg ms master ms Overhead
CPUx01 5 4.19 3.50 +20%
CPUx01 6 23.34 26.49 -12% (faster)
CPUx04 5 2.91 2.86 +2%
CPUx04 6 14.38 15.32 -6% (faster)
GPU 5 0.54 0.48 +12%
GPU 6 1.98 2.03 -2% (faster)

Summary

  • CPU large grids: at parity or slightly faster than master
  • CPU small grids: ~5-10% overhead from N+4 arrays (proportionally more cells)
  • GPU: consistent ~6-8% overhead from larger arrays and kernel launch overhead
  • Jelly p=5 outlier: -20% on CPUx01 from small grid + remeasure amplifying overhead

Optimizations applied

  1. BC!/perBC! kernel fusion: Fused pairs of ghost cell @loop calls into thick-slice operations (e.g. slice(N,1:2,j) instead of separate slice(N,1,j) and slice(N,2,j)). Halved kernel launch count in BC!/perBC!.
  2. Neumann BC mirror fix: Changed tangential Neumann from constant extrapolation (a[1]=a[3], a[2]=a[3]) to second-order symmetric reflection (a[2]=a[3], a[1]=a[4]), matching master's stencil extended to 2 ghost cells.
  3. Eliminated boundary corrections in conv_diff!: N+4 layout means inside_u covers wall faces — no separate lowerBoundary!/upperBoundary! kernels needed (4 fewer launches per conv_diff! call vs master).

Kernel launch count comparison (non-periodic)

Counted per function call for n=2 (2D) and n=3 (3D):

Function Master 2D/3D mpi-igg (fused) 2D/3D
BC! 10 / 21 8 / 12
conv_diff! (per component) 8 / 12 4 / 6
wallBC_L! (per MG level) 0 / 0 4 / 6
perBC! (per call, per dir) 2 / 2 2 / 2

In 3D, master's BC! has 21 launches (3 Dirichlet + 2 Neumann per (i,j) pair, 9 pairs) vs mpi-igg's 12 (2 per pair). The conv_diff! savings are also larger in 3D: master needs lowerBoundary!/upperBoundary! corrections (2 extra per direction), while mpi-igg's N+4 layout covers wall faces in inside_u.

mpi-igg has fewer total kernel launches than master despite the N+4 layout.


MPI Strong Scalability

Setup

  • Cases: 2D circle (Re=250) and 3D sphere (Re=3700, exitBC)
  • Backend: SIMD (single-threaded per rank, -t 1)
  • Ranks: np=1,2,4,8 on shared-memory node (--bind-to none)
  • MPI: OpenMPI via MPI.jl + ImplicitGlobalGrid.jl
  • Reproduce: bash bench/run_scaling.sh

Fixes applied

  1. _has_neighbors cache — skip halo exchange when no MPI neighbors (np=1 non-periodic)
  2. inslen precomputation — avoid Allreduce in residual! for constant value
  3. _divisible threshold — changed from N>8 to N>4 to match serial multigrid depth

2D Circle (Re=250)

Small grid (192x128)

Config ms/step Speedup
Serial 1.72 1.00x
np=1 1.80 0.96x
np=2 1.59 1.08x
np=4 1.53 1.12x
np=8 3.52 0.49x

Grid too small — compute per rank is dwarfed by MPI latency at np=8.

Large grid (768x512)

Config ms/step Speedup
Serial 28.8 1.00x
np=1 27.9 1.03x
np=2 16.4 1.76x
np=4 9.9 2.91x
np=8 12.7 2.27x

Strong scaling up to np=4 (2.91x). np=8 sees diminishing returns as communication overhead grows relative to the smaller per-rank subdomains.

3D Sphere (Re=3700, exitBC)

Small grid (128x48x48)

Config ms/step Speedup
Serial 51.7 1.00x
np=1 53.3 0.97x
np=2 37.2 1.39x
np=4 24.6 2.10x
np=8 30.0 1.72x

np=8 regresses vs np=4 due to higher solver iterations (1.5 avg vs 1.1) on the coarser multigrid hierarchy.

Large grid (256x96x96)

Config ms/step Speedup
Serial 437.4 1.00x
np=1 442.3 0.99x
np=2 317.5 1.38x
np=4 175.5 2.49x
np=8 169.8 2.58x

np=8 still improving at this grid size — 2.58x speedup. The 3D domain decomposition (2x2x2 topology) distributes the larger problem more effectively.


Hybrid MPI + KernelAbstractions threading

Tested on 768x512 grid with KA backend.

Serial KA baselines (no MPI)

Threads ms/step
1 68.9
2 39.4
4 25.2
8 27.8
14 17.9

Threading saturates around 4 threads (~25ms). Beyond that, memory bandwidth contention limits gains until ~14 threads.

Hybrid MPI+KA (8 total slots, --bind-to none)

np x threads ms/step
1 x 8 24.4
2 x 4 27.3
4 x 2 26.3
8 x 1 26.2

All configurations in the 24-27ms range. Pure threading slightly beats hybrid splits.

Higher parallelism (~14-16 total slots)

np x threads ms/step
1 x 14 18.0
2 x 7 21.3
4 x 4 21.7
8 x 2 22.7

SIMD+MPI comparison (single-threaded per rank)

Config ms/step
Serial SIMD 29.7
np=4 SIMD t=1 11.5
np=8 SIMD t=1 13.7

Conclusions

  • SIMD+MPI dominates on CPU: np=4 SIMD (11.5ms) is 1.6x faster than the best KA hybrid (1x14 at 18.0ms). SIMD vectorization outperforms KA's threading model on CPU.
  • KA threading saturates early: Beyond 4 threads, gains are marginal due to memory bandwidth contention.
  • MPI adds overhead in KA hybrid: For a fixed resource budget, pure threading slightly beats any hybrid split.
  • Recommended CPU strategy: Use SIMD backend with one MPI rank per core. KA+threads is better suited for GPU backends.

Changes made

  1. BC!/perBC! kernel fusion — thick-slice @loop calls halve kernel launches
  2. Neumann BC mirror fix — second-order symmetric reflection for N+4 ghost cells
  3. _has_neighbors cache in WaterLilyMPIExt.jl — skip halo when no neighbors
  4. inslen precomputation in Poisson struct — avoid Allreduce in residual!
  5. _divisible threshold changed from N>8 to N>4 — match serial MG depth

…imitive, MPI-aware Metrics

  Replace per-type _apply_offset methods with a generic OffsetBody wrapper
  that shifts coordinates before measure/sdf evaluation. Any AbstractBody
  subtype now works in MPI automatically without implementing _apply_offset.
  AutoBody.jl is unchanged from master.

  Add global_allreduce as the single MPI reduction primitive (identity in
  serial, Allreduce in MPI). All existing reductions (global_dot, global_sum,
  global_perdot, global_length) now build on it, reducing the MPI extension
  from 6 dispatch hooks to 2 (global_allreduce + global_min).

  Make Metrics functions (pressure_force, viscous_force, pressure_moment)
  MPI-aware via global_allreduce. Change interp and ω_θ offset from kwarg
  to positional arg for GPU safety. Add global_offset default arg to
  applyV!/applyS! so user scripts get correct global coordinates.
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 16, 2026

PR -/+ LOC breakdown

     File                                           Added Removed  Net
     ----------------------------------------------------------------------
     ext/WaterLilyMPIExt.jl                          +177      +0  +177
     ext/WaterLilyWriteVTKExt.jl                      +49      -3   +46
     src/Body.jl                                      +14      -2   +12
     src/Flow.jl                                       +6     -15    -9
     src/Metrics.jl                                   +27     -14   +13
     src/MultiLevelPoisson.jl                          +8      -8    +0
     src/Poisson.jl                                   +36     -28    +8
     src/WaterLily.jl                                 +21      -3   +18
     src/util.jl                                      +86     -46   +40
     ----------------------------------------------------------------------
     src/ total                                      +198    -116   +82
     ext/ total                                      +226      -3  +223
     TOTAL                                           +424    -119  +305

The core WaterLily changes are only +82 net lines of code — mostly dispatch hooks in util.jl (+40), OffsetBody in Body.jl (+12), MPI-aware Metrics (+13), Simulation constructor offset wrapping (+18), and Poisson inslen precompute (+8). The bulk of the new code (+223) is the MPI extension itself, which is self-contained in ext/.

b-fg added 3 commits April 16, 2026 13:15
The N+4 layout requires 4 ghost cells per side, doubling the @loop
calls in BC! and perBC! vs N+2. Fuse pairs of ghost cell copies into
single thick-slice kernels: both left ghosts read from the same interior
cell (index 3 or N-2), so they can share one kernel launch.

Adds a range-accepting slice(dims, i::AbstractUnitRange, j) overload.
BC! goes from 36 → 20 kernel launches (3D), perBC! from 4 → 2 per dir.
Profiled improvement: BC! ~1.7x slower → ~1.0-1.1x vs master.
      The tangential Neumann BC was using constant extrapolation (a[1]=a[3],
      a[2]=a[3]) instead of symmetric reflection about the wall face (a[2]=a[3],
      a[1]=a[4]). This gives second-order accuracy consistent with master's
      single-ghost-cell Neumann (a[1]=a[2])
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 16, 2026

Serial performance comparison vs master. Results are within the noise levels I'd say, +/-5-10% times. Noting that the current implementation requires N+4 cells per dimension, instead of N+2. Also noting that we have fewer kernel launches related to BCs. For a 2D case

   ────────────────────────────────┬────────────────────┬────────────────────┐
  │            Function            │    Master (2D)     │ mpi-igg fused (2D) │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ BC!                            │ 10                 │ 8                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ conv_diff! (×2)                │ 16                 │ 8                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ lowerBoundary!/upperBoundary!  │ 8 (included above) │ 0                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ wallBC_L! (per restrict level) │ 0                  │ 4                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ perBC! (per call)              │ 2/dir              │ 2/dir (same)       │
  └────────────────────────────────┴────────────────────┴────────────────────┘
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        4607 │   0.00 │     4.07 │           155.18 │     1.09 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4675 │   0.00 │     4.45 │           169.59 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1733707 │   0.00 │     2.81 │           107.09 │     1.58 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     2039313 │   0.00 │     2.77 │           105.79 │     1.60 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1980707 │   0.00 │     0.44 │            16.60 │    10.21 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2258699 │   0.00 │     0.40 │            15.38 │    11.03 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        5007 │   0.00 │    29.42 │           140.28 │     0.96 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4607 │   0.00 │    28.33 │           135.10 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1873507 │   0.00 │    17.41 │            83.04 │     1.63 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     2032707 │   0.00 │    16.97 │            80.91 │     1.67 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     2145609 │   0.00 │     2.82 │            13.43 │    10.06 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2243109 │   0.00 │     2.67 │            12.72 │    10.62 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.51 │           152.98 │     0.92 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.13 │           140.07 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1507107 │   0.00 │     2.50 │            84.82 │     1.65 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     1748907 │   0.00 │     2.37 │            80.43 │     1.74 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1687107 │   0.00 │     0.41 │            13.81 │    10.14 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     1923604 │   0.00 │     0.38 │            12.74 │    11.00 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        4207 │   0.00 │    40.63 │           172.21 │     0.96 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4207 │   0.00 │    39.15 │           165.94 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1646907 │   0.00 │    17.74 │            75.20 │     2.21 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     1888707 │   0.00 │    17.28 │            73.23 │     2.27 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1851775 │   0.00 │     3.04 │            12.91 │    12.86 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2088609 │   0.00 │     2.86 │            12.13 │    13.68 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        8259 │   0.00 │     4.19 │           319.93 │     0.84 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        5817 │   0.00 │     3.50 │           267.37 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     5863903 │   0.70 │     2.91 │           221.81 │     1.21 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     5269745 │   0.71 │     2.86 │           218.16 │     1.23 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     3974641 │   0.00 │     0.54 │            41.33 │     6.47 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     3283039 │   0.00 │     0.48 │            36.99 │     7.23 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        6533 │   0.00 │    23.34 │           222.58 │     1.14 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        7863 │   0.00 │    26.49 │           252.66 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │    11658457 │   0.33 │    14.38 │           137.18 │     1.84 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │    12258738 │   0.30 │    15.32 │           146.11 │     1.73 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     3459079 │   0.00 │     1.98 │            18.88 │    13.38 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     4232609 │   0.00 │     2.03 │            19.32 │    13.08 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg added 2 commits April 16, 2026 18:07
…r GPU+SIMD mismatch

  Move check_nthreads() into __init__() so it runs at module load time (not just
  precompilation), and guard with jl_generating_output to suppress the spurious
  single-thread warning that always fired during precompilation. Add check_mem()
  to error early when mem=CuArray/ROCArray is used with the SIMD backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant