Conversation
Introduce AbstractParMode dispatch pattern for MPI support without
method overwriting, enabling normal precompilation. All parallel hooks
(reductions, halo exchange, wall BCs, pressure pinning) dispatch through
par_mode[] — serial defines Serial() methods, WaterLilyMPIExt adds
Parallel() methods.
Key changes:
- N+4 staggered grid layout (2 ghost cells per side) replacing N+2,
making BC! and MPI halo exchange operate on the same indices
- Symmetric BC! that sets indices 1,2 and M-1,M (no save/restore needed)
- comm!/velocity_comm! hooks separate domain BCs from communication
- Poisson solver uses global reductions (global_dot, global_sum) and
comm! instead of local dot products and perBC!
- pin_pressure! removes global mean in solver! (not project!)
- wallBC_L! dispatches for MPI-aware physical wall detection
- _apply_offset wraps body map with rank-local coordinate offset
- Simplified conv_diff! (removed boundary-specific stencils)
- VTK writer dispatches save! on par_mode[] for parallel PVTI output
- WaterLilyMPIExt: halo exchange (IGG for fine grid, direct MPI for
multigrid), exitBC!, init_waterlily_mpi, global_offset
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
Add missing docstrings for exported and internal functions (set_backend, global_dot/sum/length/min, scalar_halo!, velocity_halo!, global_perdot, L₂, conv_diff!, BDIM!, project!, CFL, _apply_offset, up/down/restrict/ restrictL/restrictML/restrictL!, Vcycle!, MultiLevelPoisson solver!). Fix inaccuracies and typos and update existing docstrings with MPI-relevant details
- Add _has_neighbors flag to skip halo exchange when no MPI neighbors
exist (e.g. np=1 non-periodic), eliminating IGG update_halo! overhead
and unnecessary buffer copies
- Precompute global_length(inside(x)) in Poisson struct (inslen field)
to avoid one MPI Allreduce per residual! call
- Change parallel _divisible threshold from N>8 to N>4 (same as serial)
since coarse-level comm cost is negligible with the _has_neighbors
short-circuit
Performance — mpi-igg branchSerial performance vs masterBenchmarked with TGV (no body, no remeasure)
Sphere (3D body, no remeasure)
Jelly (3D body, remeasure every step)
Summary
Optimizations applied
Kernel launch count comparison (non-periodic)Counted per function call for n=2 (2D) and n=3 (3D):
In 3D, master's BC! has 21 launches (3 Dirichlet + 2 Neumann per (i,j) pair, 9 pairs) vs mpi-igg's 12 (2 per pair). The mpi-igg has fewer total kernel launches than master despite the N+4 layout. MPI Strong ScalabilitySetup
Fixes applied
2D Circle (Re=250)Small grid (192x128)
Grid too small — compute per rank is dwarfed by MPI latency at np=8. Large grid (768x512)
Strong scaling up to np=4 (2.91x). np=8 sees diminishing returns as communication overhead grows relative to the smaller per-rank subdomains. 3D Sphere (Re=3700, exitBC)Small grid (128x48x48)
np=8 regresses vs np=4 due to higher solver iterations (1.5 avg vs 1.1) on the coarser multigrid hierarchy. Large grid (256x96x96)
np=8 still improving at this grid size — 2.58x speedup. The 3D domain decomposition (2x2x2 topology) distributes the larger problem more effectively. Hybrid MPI + KernelAbstractions threadingTested on 768x512 grid with KA backend. Serial KA baselines (no MPI)
Threading saturates around 4 threads (~25ms). Beyond that, memory bandwidth contention limits gains until ~14 threads. Hybrid MPI+KA (8 total slots,
|
| np x threads | ms/step |
|---|---|
| 1 x 8 | 24.4 |
| 2 x 4 | 27.3 |
| 4 x 2 | 26.3 |
| 8 x 1 | 26.2 |
All configurations in the 24-27ms range. Pure threading slightly beats hybrid splits.
Higher parallelism (~14-16 total slots)
| np x threads | ms/step |
|---|---|
| 1 x 14 | 18.0 |
| 2 x 7 | 21.3 |
| 4 x 4 | 21.7 |
| 8 x 2 | 22.7 |
SIMD+MPI comparison (single-threaded per rank)
| Config | ms/step |
|---|---|
| Serial SIMD | 29.7 |
| np=4 SIMD t=1 | 11.5 |
| np=8 SIMD t=1 | 13.7 |
Conclusions
- SIMD+MPI dominates on CPU: np=4 SIMD (11.5ms) is 1.6x faster than the best KA hybrid (1x14 at 18.0ms). SIMD vectorization outperforms KA's threading model on CPU.
- KA threading saturates early: Beyond 4 threads, gains are marginal due to memory bandwidth contention.
- MPI adds overhead in KA hybrid: For a fixed resource budget, pure threading slightly beats any hybrid split.
- Recommended CPU strategy: Use SIMD backend with one MPI rank per core. KA+threads is better suited for GPU backends.
Changes made
- BC!/perBC! kernel fusion — thick-slice
@loopcalls halve kernel launches - Neumann BC mirror fix — second-order symmetric reflection for N+4 ghost cells
_has_neighborscache inWaterLilyMPIExt.jl— skip halo when no neighborsinslenprecomputation inPoissonstruct — avoid Allreduce inresidual!_divisiblethreshold changed fromN>8toN>4— match serial MG depth
…imitive, MPI-aware Metrics Replace per-type _apply_offset methods with a generic OffsetBody wrapper that shifts coordinates before measure/sdf evaluation. Any AbstractBody subtype now works in MPI automatically without implementing _apply_offset. AutoBody.jl is unchanged from master. Add global_allreduce as the single MPI reduction primitive (identity in serial, Allreduce in MPI). All existing reductions (global_dot, global_sum, global_perdot, global_length) now build on it, reducing the MPI extension from 6 dispatch hooks to 2 (global_allreduce + global_min). Make Metrics functions (pressure_force, viscous_force, pressure_moment) MPI-aware via global_allreduce. Change interp and ω_θ offset from kwarg to positional arg for GPU safety. Add global_offset default arg to applyV!/applyS! so user scripts get correct global coordinates.
|
PR -/+ LOC breakdown File Added Removed Net
----------------------------------------------------------------------
ext/WaterLilyMPIExt.jl +177 +0 +177
ext/WaterLilyWriteVTKExt.jl +49 -3 +46
src/Body.jl +14 -2 +12
src/Flow.jl +6 -15 -9
src/Metrics.jl +27 -14 +13
src/MultiLevelPoisson.jl +8 -8 +0
src/Poisson.jl +36 -28 +8
src/WaterLily.jl +21 -3 +18
src/util.jl +86 -46 +40
----------------------------------------------------------------------
src/ total +198 -116 +82
ext/ total +226 -3 +223
TOTAL +424 -119 +305The core WaterLily changes are only +82 net lines of code — mostly dispatch hooks in util.jl (+40), OffsetBody in Body.jl (+12), MPI-aware Metrics (+13), Simulation constructor offset wrapping (+18), and Poisson inslen precompute (+8). The bulk of the new code (+223) is the MPI extension itself, which is self-contained in ext/. |
The N+4 layout requires 4 ghost cells per side, doubling the @loop calls in BC! and perBC! vs N+2. Fuse pairs of ghost cell copies into single thick-slice kernels: both left ghosts read from the same interior cell (index 3 or N-2), so they can share one kernel launch. Adds a range-accepting slice(dims, i::AbstractUnitRange, j) overload. BC! goes from 36 → 20 kernel launches (3D), perBC! from 4 → 2 per dir. Profiled improvement: BC! ~1.7x slower → ~1.0-1.1x vs master.
The tangential Neumann BC was using constant extrapolation (a[1]=a[3],
a[2]=a[3]) instead of symmetric reflection about the wall face (a[2]=a[3],
a[1]=a[4]). This gives second-order accuracy consistent with master's
single-ghost-cell Neumann (a[1]=a[2])
|
Serial performance comparison vs master. Results are within the noise levels I'd say, +/-5-10% times. Noting that the current implementation requires ────────────────────────────────┬────────────────────┬────────────────────┐
│ Function │ Master (2D) │ mpi-igg fused (2D) │
├────────────────────────────────┼────────────────────┼────────────────────┤
│ BC! │ 10 │ 8 │
├────────────────────────────────┼────────────────────┼────────────────────┤
│ conv_diff! (×2) │ 16 │ 8 │
├────────────────────────────────┼────────────────────┼────────────────────┤
│ lowerBoundary!/upperBoundary! │ 8 (included above) │ 0 │
├────────────────────────────────┼────────────────────┼────────────────────┤
│ wallBC_L! (per restrict level) │ 0 │ 4 │
├────────────────────────────────┼────────────────────┼────────────────────┤
│ perBC! (per call) │ 2/dir │ 2/dir (same) │
└────────────────────────────────┴────────────────────┴────────────────────┘
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 4607 │ 0.00 │ 4.07 │ 155.18 │ 1.09 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4675 │ 0.00 │ 4.45 │ 169.59 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 1733707 │ 0.00 │ 2.81 │ 107.09 │ 1.58 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 2039313 │ 0.00 │ 2.77 │ 105.79 │ 1.60 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 1980707 │ 0.00 │ 0.44 │ 16.60 │ 10.21 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2258699 │ 0.00 │ 0.40 │ 15.38 │ 11.03 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 5007 │ 0.00 │ 29.42 │ 140.28 │ 0.96 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4607 │ 0.00 │ 28.33 │ 135.10 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 1873507 │ 0.00 │ 17.41 │ 83.04 │ 1.63 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 2032707 │ 0.00 │ 16.97 │ 80.91 │ 1.67 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 2145609 │ 0.00 │ 2.82 │ 13.43 │ 10.06 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2243109 │ 0.00 │ 2.67 │ 12.72 │ 10.62 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 3807 │ 0.00 │ 4.51 │ 152.98 │ 0.92 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 3807 │ 0.00 │ 4.13 │ 140.07 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 1507107 │ 0.00 │ 2.50 │ 84.82 │ 1.65 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 1748907 │ 0.00 │ 2.37 │ 80.43 │ 1.74 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 1687107 │ 0.00 │ 0.41 │ 13.81 │ 10.14 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 1923604 │ 0.00 │ 0.38 │ 12.74 │ 11.00 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 4207 │ 0.00 │ 40.63 │ 172.21 │ 0.96 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4207 │ 0.00 │ 39.15 │ 165.94 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 1646907 │ 0.00 │ 17.74 │ 75.20 │ 2.21 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 1888707 │ 0.00 │ 17.28 │ 73.23 │ 2.27 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 1851775 │ 0.00 │ 3.04 │ 12.91 │ 12.86 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2088609 │ 0.00 │ 2.86 │ 12.13 │ 13.68 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 8259 │ 0.00 │ 4.19 │ 319.93 │ 0.84 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 5817 │ 0.00 │ 3.50 │ 267.37 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 5863903 │ 0.70 │ 2.91 │ 221.81 │ 1.21 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 5269745 │ 0.71 │ 2.86 │ 218.16 │ 1.23 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 3974641 │ 0.00 │ 0.54 │ 41.33 │ 6.47 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 3283039 │ 0.00 │ 0.48 │ 36.99 │ 7.23 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ mpi-igg │ 1.11.5 │ Float32 │ 6533 │ 0.00 │ 23.34 │ 222.58 │ 1.14 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 7863 │ 0.00 │ 26.49 │ 252.66 │ 1.00 │
│ CPUx04 │ mpi-igg │ 1.11.5 │ Float32 │ 11658457 │ 0.33 │ 14.38 │ 137.18 │ 1.84 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 12258738 │ 0.30 │ 15.32 │ 146.11 │ 1.73 │
│ GPU-NVIDIA │ mpi-igg │ 1.11.5 │ Float32 │ 3459079 │ 0.00 │ 1.98 │ 18.88 │ 13.38 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 4232609 │ 0.00 │ 2.03 │ 19.32 │ 13.08 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘ |
…r GPU+SIMD mismatch Move check_nthreads() into __init__() so it runs at module load time (not just precompilation), and guard with jl_generating_output to suppress the spurious single-thread warning that always fired during precompilation. Add check_mem() to error early when mem=CuArray/ROCArray is used with the SIMD backend.
Claude-assisted MPI extension using ImplicitGlobalGrid.jl
This branch is a clean implementation of #236 but without including the
AbstractBCandAbstractFlowtypes. It incorporate MPI via IGG keeping it as minimal as possible. Tests have also been adjusted to the newN+4array sizes.Some pending tasks are:
locoffset strategy:OffsetBodywrapsAbstractBody, andlocincludesoffsetvalueExplore hide comms(Mostly relevant on GPUs, but requires big refactor of range functions, effectively splitting them between inner cells and halo cells)Summary
AbstractParModedispatch pattern (Serial/Parallel) for MPI support without method overwriting, preserving precompilationBC!and MPI halo exchange operate on the same 4 indices — no save/restore logic neededWaterLilyMPIExt(triggered byImplicitGlobalGrid+MPI): global reductions, halo exchange (IGG for fine grid, direct MPI for multigrid),exitBC!,wallBC_L!,init_waterlily_mpi,global_offsetsave!inWaterLilyWriteVTKExtthroughpar_mode[]for parallel PVTI outputDesign
All MPI hooks dispatch through
par_mode[](defaults toSerial()). The extension setspar_mode[] = Parallel(comm)and adds new methods — no method overwriting. This keeps the serial code path unchanged and lets the extension precompile normally.The N+4 layout gives symmetric ghost/boundary cells (indices 1,2 left; M-1,M right).
BC!sets Dirichlet/Neumann at these indices; the MPI halo overwrites the same indices at rank-internal boundaries. At physical boundaries (no MPI neighbor),BC!values are preserved. This eliminates the need for saving and restoring boundary values around halo exchanges.Key internal changes:
comm!/velocity_comm!replace bareperBC!in the Poisson solver — MPI rank-internal boundaries have L≠0 and need halo exchange regardless of periodicitypin_pressure!removes the global mean insidesolver!(notproject!), usingglobal_sum/global_lengthfor MPI-consistent pinningwallBC_L!dispatches onpar_mode[]so only physical walls get L=0 (MPI-internal faces keep L=1)_apply_offsetwraps the body'smapfunction with the rank-local coordinate offset atSimulationconstruction — user SDFs are written in global coordinates, identical to serialconv_diff!simplified: boundary-specific stencils (ϕuP,ϕuL,ϕuR,lowerBoundary!,upperBoundary!) removed since N+4 ghost cells provide the extra stencil pointsDispatched hooks
global_dot,global_sum,global_length,global_minMPI.Allreduceglobal_perdotMPI.Allreduceof localscalar_halo!,velocity_halo!update_halo!or direct MPIcomm!,velocity_comm!perBC!wallBC_L!exitBC!divisibleN>4N>8(stricter for multigrid)global_offsetValidation
Serial and parallel (4 ranks) produce quantitatively matching results:
Remaining max|Δp| differences are confined to body-interior cells where the Poisson operator has
iD=0(unconstrained pressure) — a known immersed-boundary characteristic, not a serial/parallel discrepancy.