Add MPI parallelism via ImplicitGlobalGrid extension by b-fg · Pull Request #281 · WaterLily-jl/WaterLily.jl

b-fg · 2026-04-15T08:14:29Z

Claude-assisted MPI extension using ImplicitGlobalGrid.jl

This branch is a clean implementation of #236 but without including the AbstractBC and AbstractFlow types. It incorporate MPI via IGG keeping it as minimal as possible. Tests have also been adjusted to the new N+4 array sizes.

Some pending tasks are:

Summary

Introduce AbstractParMode dispatch pattern (Serial/Parallel) for MPI support without method overwriting, preserving precompilation
Adopt N+4 staggered grid layout (2 ghost cells per side), making BC! and MPI halo exchange operate on the same 4 indices — no save/restore logic needed
Add WaterLilyMPIExt (triggered by ImplicitGlobalGrid + MPI): global reductions, halo exchange (IGG for fine grid, direct MPI for multigrid), exitBC!, wallBC_L!, init_waterlily_mpi, global_offset
Dispatch save! in WaterLilyWriteVTKExt through par_mode[] for parallel PVTI output

Design

All MPI hooks dispatch through par_mode[] (defaults to Serial()). The extension sets par_mode[] = Parallel(comm) and adds new methods — no method overwriting. This keeps the serial code path unchanged and lets the extension precompile normally.

The N+4 layout gives symmetric ghost/boundary cells (indices 1,2 left; M-1,M right). BC! sets Dirichlet/Neumann at these indices; the MPI halo overwrites the same indices at rank-internal boundaries. At physical boundaries (no MPI neighbor), BC! values are preserved. This eliminates the need for saving and restoring boundary values around halo exchanges.

Key internal changes:

comm!/velocity_comm! replace bare perBC! in the Poisson solver — MPI rank-internal boundaries have L≠0 and need halo exchange regardless of periodicity
pin_pressure! removes the global mean inside solver! (not project!), using global_sum/global_length for MPI-consistent pinning
wallBC_L! dispatches on par_mode[] so only physical walls get L=0 (MPI-internal faces keep L=1)
_apply_offset wraps the body's map function with the rank-local coordinate offset at Simulation construction — user SDFs are written in global coordinates, identical to serial
conv_diff! simplified: boundary-specific stencils (ϕuP, ϕuL, ϕuR, lowerBoundary!, upperBoundary!) removed since N+4 ghost cells provide the extra stencil points

Dispatched hooks

Hook	Serial	Parallel (MPI extension)
`global_dot`, `global_sum`, `global_length`, `global_min`	local operation	`MPI.Allreduce`
`global_perdot`	dot product (full or interior)	`MPI.Allreduce` of local
`scalar_halo!`, `velocity_halo!`	no-op	IGG `update_halo!` or direct MPI
`comm!`, `velocity_comm!`	`perBC!`	MPI halo exchange
`wallBC_L!`	zero L at faces 3 and M-1	zero only at physical walls
`exitBC!`	local convective exit	global MPI reductions for flux balance
`divisible`	`N>4`	`N>8` (stricter for multigrid)
`global_offset`	zero	rank-local origin from IGG

Validation

Serial and parallel (4 ranks) produce quantitatively matching results:

2D circle (Re=250): velocity rms|ΔU| ~ 2.7×10⁻⁵
3D sphere (160×128×128, Re=250): velocity rms|ΔU| ~ 1.1×10⁻⁵

Remaining max|Δp| differences are confined to body-interior cells where the Poisson operator has iD=0 (unconstrained pressure) — a known immersed-boundary characteristic, not a serial/parallel discrepancy.

Introduce AbstractParMode dispatch pattern for MPI support without method overwriting, enabling normal precompilation. All parallel hooks (reductions, halo exchange, wall BCs, pressure pinning) dispatch through par_mode[] — serial defines Serial() methods, WaterLilyMPIExt adds Parallel() methods. Key changes: - N+4 staggered grid layout (2 ghost cells per side) replacing N+2, making BC! and MPI halo exchange operate on the same indices - Symmetric BC! that sets indices 1,2 and M-1,M (no save/restore needed) - comm!/velocity_comm! hooks separate domain BCs from communication - Poisson solver uses global reductions (global_dot, global_sum) and comm! instead of local dot products and perBC! - pin_pressure! removes global mean in solver! (not project!) - wallBC_L! dispatches for MPI-aware physical wall detection - _apply_offset wraps body map with rank-local coordinate offset - Simplified conv_diff! (removed boundary-specific stencils) - VTK writer dispatches save! on par_mode[] for parallel PVTI output - WaterLilyMPIExt: halo exchange (IGG for fine grid, direct MPI for multigrid), exitBC!, init_waterlily_mpi, global_offset

codecov · 2026-04-15T08:21:53Z

Codecov Report

❌ Patch coverage is 36.26761% with 181 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
ext/WaterLilyMPIExt.jl	0.00%	125 Missing ⚠️
ext/WaterLilyWriteVTKExt.jl	5.71%	33 Missing ⚠️
src/util.jl	81.53%	12 Missing ⚠️
src/Body.jl	37.50%	5 Missing ⚠️
src/WaterLily.jl	78.94%	4 Missing ⚠️
src/Metrics.jl	83.33%	2 Missing ⚠️

Files with missing lines	Coverage Δ
src/Flow.jl	`100.00% <100.00%> (ø)`
src/MultiLevelPoisson.jl	`98.64% <100.00%> (ø)`
src/Poisson.jl	`97.84% <100.00%> (ø)`
src/Metrics.jl	`89.24% <83.33%> (-0.87%)`	⬇️
src/WaterLily.jl	`86.53% <78.94%> (+1.24%)`	⬆️
src/Body.jl	`89.13% <37.50%> (-10.87%)`	⬇️
src/util.jl	`77.05% <81.53%> (+2.62%)`	⬆️
ext/WaterLilyWriteVTKExt.jl	`40.00% <5.71%> (-60.00%)`	⬇️
ext/WaterLilyMPIExt.jl	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add missing docstrings for exported and internal functions (set_backend, global_dot/sum/length/min, scalar_halo!, velocity_halo!, global_perdot, L₂, conv_diff!, BDIM!, project!, CFL, _apply_offset, up/down/restrict/ restrictL/restrictML/restrictL!, Vcycle!, MultiLevelPoisson solver!). Fix inaccuracies and typos and update existing docstrings with MPI-relevant details

- Add _has_neighbors flag to skip halo exchange when no MPI neighbors exist (e.g. np=1 non-periodic), eliminating IGG update_halo! overhead and unnecessary buffer copies - Precompute global_length(inside(x)) in Poisson struct (inslen field) to avoid one MPI Allreduce per residual! call - Change parallel _divisible threshold from N>8 to N>4 (same as serial) since coarse-level comm cost is negligible with the _has_neighbors short-circuit

b-fg · 2026-04-15T12:40:51Z

Performance — mpi-igg branch

Serial performance vs master

Benchmarked with WaterLily-Benchmarks suite: 100 sim_steps, Float32, CPUx01/CPUx04/GPU-NVIDIA.
Commit a52d84a (mpi-igg) vs 29f9f43 (master, tagged v1.6.1).

TGV (no body, no remeasure)

Backend	log2p	mpi-igg ms	master ms	Overhead
CPUx01	6	4.07	4.45	-9% (faster)
CPUx01	7	29.42	28.33	+4%
CPUx04	6	2.81	2.77	+1%
CPUx04	7	17.41	16.97	+3%
GPU	6	0.44	0.40	+8%
GPU	7	2.82	2.67	+6%

Sphere (3D body, no remeasure)

Backend	log2p	mpi-igg ms	master ms	Overhead
CPUx01	3	4.51	4.13	+9%
CPUx01	4	40.63	39.15	+4%
CPUx04	3	2.50	2.37	+5%
CPUx04	4	17.74	17.28	+3%
GPU	3	0.41	0.38	+8%
GPU	4	3.04	2.86	+6%

Jelly (3D body, remeasure every step)

Backend	log2p	mpi-igg ms	master ms	Overhead
CPUx01	5	4.19	3.50	+20%
CPUx01	6	23.34	26.49	-12% (faster)
CPUx04	5	2.91	2.86	+2%
CPUx04	6	14.38	15.32	-6% (faster)
GPU	5	0.54	0.48	+12%
GPU	6	1.98	2.03	-2% (faster)

Summary

CPU large grids: at parity or slightly faster than master
CPU small grids: ~5-10% overhead from N+4 arrays (proportionally more cells)
GPU: consistent ~6-8% overhead from larger arrays and kernel launch overhead
Jelly p=5 outlier: -20% on CPUx01 from small grid + remeasure amplifying overhead

Optimizations applied

BC!/perBC! kernel fusion: Fused pairs of ghost cell @loop calls into thick-slice operations (e.g. slice(N,1:2,j) instead of separate slice(N,1,j) and slice(N,2,j)). Halved kernel launch count in BC!/perBC!.
Neumann BC mirror fix: Changed tangential Neumann from constant extrapolation (a[1]=a[3], a[2]=a[3]) to second-order symmetric reflection (a[2]=a[3], a[1]=a[4]), matching master's stencil extended to 2 ghost cells.
Eliminated boundary corrections in conv_diff!: N+4 layout means inside_u covers wall faces — no separate lowerBoundary!/upperBoundary! kernels needed (4 fewer launches per conv_diff! call vs master).

Kernel launch count comparison (non-periodic)

Counted per function call for n=2 (2D) and n=3 (3D):

Function	Master 2D/3D	mpi-igg (fused) 2D/3D
BC!	10 / 21	8 / 12
conv_diff! (per component)	8 / 12	4 / 6
wallBC_L! (per MG level)	0 / 0	4 / 6
perBC! (per call, per dir)	2 / 2	2 / 2

In 3D, master's BC! has 21 launches (3 Dirichlet + 2 Neumann per (i,j) pair, 9 pairs) vs mpi-igg's 12 (2 per pair). The conv_diff! savings are also larger in 3D: master needs lowerBoundary!/upperBoundary! corrections (2 extra per direction), while mpi-igg's N+4 layout covers wall faces in inside_u.

mpi-igg has fewer total kernel launches than master despite the N+4 layout.

MPI Strong Scalability

Setup

Cases: 2D circle (Re=250) and 3D sphere (Re=3700, exitBC)
Backend: SIMD (single-threaded per rank, -t 1)
Ranks: np=1,2,4,8 on shared-memory node (--bind-to none)
MPI: OpenMPI via MPI.jl + ImplicitGlobalGrid.jl
Reproduce: bash bench/run_scaling.sh

Fixes applied

_has_neighbors cache — skip halo exchange when no MPI neighbors (np=1 non-periodic)
inslen precomputation — avoid Allreduce in residual! for constant value
_divisible threshold — changed from N>8 to N>4 to match serial multigrid depth

2D Circle (Re=250)

Small grid (192x128)

Config	ms/step	Speedup
Serial	1.72	1.00x
np=1	1.80	0.96x
np=2	1.59	1.08x
np=4	1.53	1.12x
np=8	3.52	0.49x

Grid too small — compute per rank is dwarfed by MPI latency at np=8.

Large grid (768x512)

Config	ms/step	Speedup
Serial	28.8	1.00x
np=1	27.9	1.03x
np=2	16.4	1.76x
np=4	9.9	2.91x
np=8	12.7	2.27x

Strong scaling up to np=4 (2.91x). np=8 sees diminishing returns as communication overhead grows relative to the smaller per-rank subdomains.

3D Sphere (Re=3700, exitBC)

Small grid (128x48x48)

Config	ms/step	Speedup
Serial	51.7	1.00x
np=1	53.3	0.97x
np=2	37.2	1.39x
np=4	24.6	2.10x
np=8	30.0	1.72x

np=8 regresses vs np=4 due to higher solver iterations (1.5 avg vs 1.1) on the coarser multigrid hierarchy.

Large grid (256x96x96)

Config	ms/step	Speedup
Serial	437.4	1.00x
np=1	442.3	0.99x
np=2	317.5	1.38x
np=4	175.5	2.49x
np=8	169.8	2.58x

np=8 still improving at this grid size — 2.58x speedup. The 3D domain decomposition (2x2x2 topology) distributes the larger problem more effectively.

Hybrid MPI + KernelAbstractions threading

Tested on 768x512 grid with KA backend.

Serial KA baselines (no MPI)

Threads	ms/step
1	68.9
2	39.4
4	25.2
8	27.8
14	17.9

Threading saturates around 4 threads (~25ms). Beyond that, memory bandwidth contention limits gains until ~14 threads.

Hybrid MPI+KA (8 total slots, `--bind-to none`)

np x threads	ms/step
1 x 8	24.4
2 x 4	27.3
4 x 2	26.3
8 x 1	26.2

All configurations in the 24-27ms range. Pure threading slightly beats hybrid splits.

Higher parallelism (~14-16 total slots)

np x threads	ms/step
1 x 14	18.0
2 x 7	21.3
4 x 4	21.7
8 x 2	22.7

SIMD+MPI comparison (single-threaded per rank)

Config	ms/step
Serial SIMD	29.7
np=4 SIMD t=1	11.5
np=8 SIMD t=1	13.7

Conclusions

SIMD+MPI dominates on CPU: np=4 SIMD (11.5ms) is 1.6x faster than the best KA hybrid (1x14 at 18.0ms). SIMD vectorization outperforms KA's threading model on CPU.
KA threading saturates early: Beyond 4 threads, gains are marginal due to memory bandwidth contention.
MPI adds overhead in KA hybrid: For a fixed resource budget, pure threading slightly beats any hybrid split.
Recommended CPU strategy: Use SIMD backend with one MPI rank per core. KA+threads is better suited for GPU backends.

Changes made

BC!/perBC! kernel fusion — thick-slice @loop calls halve kernel launches
Neumann BC mirror fix — second-order symmetric reflection for N+4 ghost cells
_has_neighbors cache in WaterLilyMPIExt.jl — skip halo when no neighbors
inslen precomputation in Poisson struct — avoid Allreduce in residual!
_divisible threshold changed from N>8 to N>4 — match serial MG depth

…imitive, MPI-aware Metrics Replace per-type _apply_offset methods with a generic OffsetBody wrapper that shifts coordinates before measure/sdf evaluation. Any AbstractBody subtype now works in MPI automatically without implementing _apply_offset. AutoBody.jl is unchanged from master. Add global_allreduce as the single MPI reduction primitive (identity in serial, Allreduce in MPI). All existing reductions (global_dot, global_sum, global_perdot, global_length) now build on it, reducing the MPI extension from 6 dispatch hooks to 2 (global_allreduce + global_min). Make Metrics functions (pressure_force, viscous_force, pressure_moment) MPI-aware via global_allreduce. Change interp and ω_θ offset from kwarg to positional arg for GPU safety. Add global_offset default arg to applyV!/applyS! so user scripts get correct global coordinates.

b-fg · 2026-04-16T10:00:24Z

PR -/+ LOC breakdown

     File                                           Added Removed  Net
     ----------------------------------------------------------------------
     ext/WaterLilyMPIExt.jl                          +177      +0  +177
     ext/WaterLilyWriteVTKExt.jl                      +49      -3   +46
     src/Body.jl                                      +14      -2   +12
     src/Flow.jl                                       +6     -15    -9
     src/Metrics.jl                                   +27     -14   +13
     src/MultiLevelPoisson.jl                          +8      -8    +0
     src/Poisson.jl                                   +36     -28    +8
     src/WaterLily.jl                                 +21      -3   +18
     src/util.jl                                      +86     -46   +40
     ----------------------------------------------------------------------
     src/ total                                      +198    -116   +82
     ext/ total                                      +226      -3  +223
     TOTAL                                           +424    -119  +305

The core WaterLily changes are only +82 net lines of code — mostly dispatch hooks in util.jl (+40), OffsetBody in Body.jl (+12), MPI-aware Metrics (+13), Simulation constructor offset wrapping (+18), and Poisson inslen precompute (+8). The bulk of the new code (+223) is the MPI extension itself, which is self-contained in ext/.

@loop

The N+4 layout requires 4 ghost cells per side, doubling the @loop calls in BC! and perBC! vs N+2. Fuse pairs of ghost cell copies into single thick-slice kernels: both left ghosts read from the same interior cell (index 3 or N-2), so they can share one kernel launch. Adds a range-accepting slice(dims, i::AbstractUnitRange, j) overload. BC! goes from 36 → 20 kernel launches (3D), perBC! from 4 → 2 per dir. Profiled improvement: BC! ~1.7x slower → ~1.0-1.1x vs master.

The tangential Neumann BC was using constant extrapolation (a[1]=a[3], a[2]=a[3]) instead of symmetric reflection about the wall face (a[2]=a[3], a[1]=a[4]). This gives second-order accuracy consistent with master's single-ghost-cell Neumann (a[1]=a[2])

b-fg · 2026-04-16T14:57:32Z

Serial performance comparison vs master. Results are within the noise levels I'd say, +/-5-10% times. Noting that the current implementation requires N+4 cells per dimension, instead of N+2. Also noting that we have fewer kernel launches related to BCs. For a 2D case

   ────────────────────────────────┬────────────────────┬────────────────────┐
  │            Function            │    Master (2D)     │ mpi-igg fused (2D) │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ BC!                            │ 10                 │ 8                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ conv_diff! (×2)                │ 16                 │ 8                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ lowerBoundary!/upperBoundary!  │ 8 (included above) │ 0                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ wallBC_L! (per restrict level) │ 0                  │ 4                  │
  ├────────────────────────────────┼────────────────────┼────────────────────┤
  │ perBC! (per call)              │ 2/dir              │ 2/dir (same)       │
  └────────────────────────────────┴────────────────────┴────────────────────┘

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        4607 │   0.00 │     4.07 │           155.18 │     1.09 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4675 │   0.00 │     4.45 │           169.59 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1733707 │   0.00 │     2.81 │           107.09 │     1.58 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     2039313 │   0.00 │     2.77 │           105.79 │     1.60 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1980707 │   0.00 │     0.44 │            16.60 │    10.21 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2258699 │   0.00 │     0.40 │            15.38 │    11.03 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        5007 │   0.00 │    29.42 │           140.28 │     0.96 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4607 │   0.00 │    28.33 │           135.10 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1873507 │   0.00 │    17.41 │            83.04 │     1.63 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     2032707 │   0.00 │    16.97 │            80.91 │     1.67 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     2145609 │   0.00 │     2.82 │            13.43 │    10.06 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2243109 │   0.00 │     2.67 │            12.72 │    10.62 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.51 │           152.98 │     0.92 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.13 │           140.07 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1507107 │   0.00 │     2.50 │            84.82 │     1.65 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     1748907 │   0.00 │     2.37 │            80.43 │     1.74 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1687107 │   0.00 │     0.41 │            13.81 │    10.14 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     1923604 │   0.00 │     0.38 │            12.74 │    11.00 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        4207 │   0.00 │    40.63 │           172.21 │     0.96 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        4207 │   0.00 │    39.15 │           165.94 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     1646907 │   0.00 │    17.74 │            75.20 │     2.21 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     1888707 │   0.00 │    17.28 │            73.23 │     2.27 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     1851775 │   0.00 │     3.04 │            12.91 │    12.86 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     2088609 │   0.00 │     2.86 │            12.13 │    13.68 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        8259 │   0.00 │     4.19 │           319.93 │     0.84 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        5817 │   0.00 │     3.50 │           267.37 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │     5863903 │   0.70 │     2.91 │           221.81 │     1.21 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │     5269745 │   0.71 │     2.86 │           218.16 │     1.23 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     3974641 │   0.00 │     0.54 │            41.33 │     6.47 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     3283039 │   0.00 │     0.48 │            36.99 │     7.23 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   mpi-igg │ 1.11.5 │   Float32 │        6533 │   0.00 │    23.34 │           222.58 │     1.14 │
│     CPUx01 │    v1.6.1 │ 1.11.5 │   Float32 │        7863 │   0.00 │    26.49 │           252.66 │     1.00 │
│     CPUx04 │   mpi-igg │ 1.11.5 │   Float32 │    11658457 │   0.33 │    14.38 │           137.18 │     1.84 │
│     CPUx04 │    v1.6.1 │ 1.11.5 │   Float32 │    12258738 │   0.30 │    15.32 │           146.11 │     1.73 │
│ GPU-NVIDIA │   mpi-igg │ 1.11.5 │   Float32 │     3459079 │   0.00 │     1.98 │            18.88 │    13.38 │
│ GPU-NVIDIA │    v1.6.1 │ 1.11.5 │   Float32 │     4232609 │   0.00 │     2.03 │            19.32 │    13.08 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

…low constructor)

…r GPU+SIMD mismatch Move check_nthreads() into __init__() so it runs at module load time (not just precompilation), and guard with jl_generating_output to suppress the spurious single-thread warning that always fired during precompilation. Add check_mem() to error early when mem=CuArray/ROCArray is used with the SIMD backend.

b-fg mentioned this pull request Apr 15, 2026

MPI via ImplicitGlobalGrid.jl + abstract flow and BCs #236

Draft

b-fg added 3 commits April 16, 2026 13:15

Fixed test by adding default pressure_moment offset

6cc864a

b-fg added 2 commits April 16, 2026 18:07

Revert NoBody measure! to master's no-op (μ₀ already initialized in F…

663fcb8

…low constructor)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MPI parallelism via ImplicitGlobalGrid extension#281

Add MPI parallelism via ImplicitGlobalGrid extension#281
b-fg wants to merge 9 commits intomasterfrom
mpi-igg

b-fg commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

b-fg commented Apr 15, 2026 •

edited

Loading

Uh oh!

b-fg commented Apr 16, 2026

Uh oh!

b-fg commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

b-fg commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Dispatched hooks

Validation

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

b-fg commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance — mpi-igg branch

Serial performance vs master

TGV (no body, no remeasure)

Sphere (3D body, no remeasure)

Jelly (3D body, remeasure every step)

Summary

Optimizations applied

Kernel launch count comparison (non-periodic)

MPI Strong Scalability

Setup

Fixes applied

2D Circle (Re=250)

Small grid (192x128)

Large grid (768x512)

3D Sphere (Re=3700, exitBC)

Small grid (128x48x48)

Large grid (256x96x96)

Hybrid MPI + KernelAbstractions threading

Serial KA baselines (no MPI)

Hybrid MPI+KA (8 total slots, --bind-to none)

Higher parallelism (~14-16 total slots)

SIMD+MPI comparison (single-threaded per rank)

Conclusions

Changes made

Uh oh!

b-fg commented Apr 16, 2026

Uh oh!

b-fg commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

b-fg commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading

b-fg commented Apr 15, 2026 •

edited

Loading

Hybrid MPI+KA (8 total slots, `--bind-to none`)

b-fg commented Apr 16, 2026 •

edited

Loading