MPI via ImplicitGlobalGrid.jl + abstract flow and BCs by b-fg · Pull Request #236 · WaterLily-jl/WaterLily.jl

b-fg · 2025-07-07T17:01:58Z

First implementation of abstract boundary conditions (BCs), which are now called during project!. Needs improvements in docs and some cleanup.

…e cleanup.

b-fg · 2025-07-07T22:40:04Z

This is a first working version. Feel free to suggests edits! We should also create a Biot-Savart simulation to showcase this feature.

codecov · 2025-07-07T22:48:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/BC.jl	`100.00% <100.00%> (ø)`
src/Body.jl	`100.00% <100.00%> (ø)`
src/Flow.jl	`100.00% <100.00%> (ø)`
src/WaterLily.jl	`85.71% <100.00%> (+0.42%)`	⬆️
src/util.jl	`80.31% <100.00%> (+6.46%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

b-fg · 2025-07-07T23:08:12Z

Benchmarks for reference

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.14 │           158.02 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.13 │           157.73 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     2273557 │   0.00 │     2.90 │           110.45 │     1.43 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     2273557 │   0.00 │     2.96 │           113.01 │     1.40 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     2982617 │   0.00 │     0.58 │            22.05 │     7.15 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     2982520 │   0.00 │     0.58 │            21.98 │     7.18 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.92 │           123.60 │     0.99 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.72 │           122.63 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     2116285 │   0.00 │    17.63 │            84.05 │     1.46 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     2116285 │   0.00 │    17.35 │            82.73 │     1.48 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     2728564 │   0.00 │     3.03 │            14.44 │     8.49 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     2728455 │   0.00 │     3.03 │            14.46 │     8.48 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.47 │           264.98 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.47 │           264.43 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     4358996 │   0.66 │     3.28 │           250.55 │     1.06 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     4358808 │   0.63 │     3.26 │           248.74 │     1.06 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     5678335 │   0.00 │     0.89 │            68.04 │     3.89 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     5680922 │   0.00 │     0.91 │            69.59 │     3.80 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        8307 │   0.00 │    25.72 │           245.31 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        8307 │   0.00 │    25.63 │           244.39 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     5699444 │   0.16 │    16.83 │           160.49 │     1.52 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     5699880 │   0.16 │    16.76 │           159.85 │     1.53 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     7625557 │   0.59 │     3.24 │            30.89 │     7.91 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     7625469 │   0.60 │     3.20 │            30.49 │     8.02 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg · 2025-07-08T17:01:57Z

Closes #228

b-fg · 2025-12-28T18:50:21Z

This branch is ready to implement CoupledSim @weymouth. BCs are applied within project!, instead of within mom_step!, in the new BC type.

weymouth · 2025-12-28T18:59:55Z

I'll take a look this week.

b-fg · 2026-04-10T13:34:07Z

Hi @luraess @omlins ! This is a crude Claude-assisted implementation of MPI in WaterLily using ImplicitGlobalGrid. IGG is amazingly working out of the box, so kudos for that!

I wanted to ask about a particular design choice we need to make in our solver. Since we use a geometric multigrid method in the pressure solver, communication needs to happen not only in our original grid size, for also at the coarser levels when halving the grid iteratively in the V-cycle of the GMG solver. Is this something that could be handled by IGG? Effectively, we need to update the halos of different array sizes. Maybe we could have a IGG grid for each of the GMG grid levels? Currently, there is a manual MPI workaround for this part, but I would rather let IGG handle all MPI communications in the solver, including GMG, if possible.

Thanks!

luraess · 2026-04-10T14:09:13Z

That's something we had started thinking about and discussing with @omlins and should have an IGG option for it eventually.

b-fg · 2026-04-10T14:12:07Z

Great! Looking forward to that!

omlins · 2026-04-10T14:21:30Z

Hi @b-fg : Thanks for sharing this. I'm glad to hear that you're enjoying IGG :)

I've been thinking about this for years actually, exactly for this use case: GMG. The only reason it has not yet been implemented is that I had not yet restarted working with GMG, nor somebody else has asked for it...

@Dvegrod had started to work on that in the scope of his Masters thesis that I supervised - two years ago -, but did not get to finish it at that time as this was not the focus of his thesis: eth-cscs/ImplicitGlobalGrid.jl#88

I just asked him about that and he said that he's going to finish this now :)

b-fg · 2026-04-10T14:27:26Z

That's great to hear, many thanks! IGG is really the perfect fit for WaterLily, and we are fully on board on integrating it in WaterLily 2.0, which has the milestone of memory-distributed parallelism. So the less workarounds we have to (not as efficiently) implement ourselves, the better :) Please let me know if we can help in any manner, even if just for testing!

wallBC_L! only zeroed L at the left wall (face 3) but not the right wall (face N[j]-1), leaving the Poisson operator coupled to ghost cells on the right boundary. This caused serial pressure to blow up to ~68 after ~30 steps. The old L≡μ₀ sharing masked this because BC!(μ₀,...) zeroed both walls implicitly. Also fix pin_pressure! to sum only interior cells instead of the full array (which includes ghost cells).

…F can now be define as usual

Replace generator-based reductions with LinearAlgebra.⋅ and @views via new dot_R/sum_R helpers, which work on both CPU and GPU arrays. Both accept an optional R keyword for the reduction range. MPI extension calls the serial helper locally then Allreduce, avoiding function body duplication. Simplify pin_pressure! accordingly.

Eliminate the ParallelBC type and maybe_parallel wrapper by moving halo exchange calls into the serial code via existing no-op hooks (velocity_halo in BC and measure). The MPI extension now only overrides wallBC_L, exitBC, divisible, and the global reduction hooks — no more BC, measure, or update overrides needed.

pressureBC was a dispatch point for ParallelBC which no longer exists. Replace the sole call site in project with scalar_halo(b.x) and delete the function and its export.

Brings in master changes: GaussSeidelRB! smoother, perdot for periodic PCG, measure_sdf! separation, relaxation parameter ω in multigrid, allocation-free restrictL!. Conflict resolutions: - AutoBody.jl: keep master's chain-rule measure refactor with global_offset - Body.jl: fix a.σ → bc.σ (abstract_BCs moved fields to BC struct) - BC.jl: add local_perdot/global_perdot hooks, fix positional arg call - Poisson.jl: merge scalar_halo!/global_dot hooks with new GS-RB and ω - MultiLevelPoisson.jl: merge wallBC_L! call with new ω and GS-RB - util.jl: keep local_dot/local_sum hooks, drop old dot_R/sum_R - MPI ext: qualify local_dot/local_sum, add global_perdot override - maintests.jl: use global_perdot, relaxed pois.n assertion, bc.V

b-fg · 2026-04-12T00:13:14Z

Current benchmarks vs WL 1.6.1 serial. In general it is pretty similar, but there are some weird speedups/slowdowns. Maybe coming from measure, but needs to be further investigated.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        4607 │   0.00 │     3.61 │           137.61 │     1.25 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4675 │   0.00 │     4.51 │           172.05 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2216107 │   0.00 │     2.84 │           108.16 │     1.59 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     2039313 │   0.00 │     2.75 │           105.05 │     1.64 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2361907 │   0.00 │     0.45 │            17.27 │     9.97 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2264335 │   0.00 │     0.41 │            15.68 │    10.97 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        5087 │   0.00 │    27.61 │           131.67 │     1.03 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4607 │   0.00 │    28.38 │           135.35 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2381177 │   0.00 │    17.60 │            83.93 │     1.61 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     2032707 │   0.00 │    16.78 │            80.00 │     1.69 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2558162 │   0.00 │     2.89 │            13.77 │     9.83 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2248609 │   0.00 │     2.65 │            12.62 │    10.72 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.06 │           137.76 │     1.03 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.19 │           142.15 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     1959107 │   0.00 │     2.52 │            85.60 │     1.66 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     1748907 │   0.00 │     2.39 │            81.12 │     1.75 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2043505 │   0.00 │     0.42 │            14.25 │     9.97 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     1927507 │   0.00 │     0.37 │            12.54 │    11.34 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        4207 │   0.00 │    35.95 │           152.37 │     1.09 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4207 │   0.00 │    39.14 │           165.90 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2091355 │   0.00 │    17.72 │            75.12 │     2.21 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     1888707 │   0.00 │    17.24 │            73.07 │     2.27 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2208107 │   0.00 │     3.00 │            12.70 │    13.06 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2093308 │   0.00 │     2.87 │            12.16 │    13.65 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        9019 │   0.00 │     4.17 │           318.47 │     0.84 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        5817 │   0.00 │     3.52 │           268.26 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     7058383 │   0.89 │     3.24 │           247.23 │     1.09 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     5269745 │   0.63 │     2.83 │           216.01 │     1.24 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     5342419 │   0.00 │     0.72 │            55.21 │     4.86 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     3297948 │   0.00 │     0.45 │            34.54 │     7.77 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        5647 │   0.00 │    21.34 │           203.51 │     1.24 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        7863 │   0.00 │    26.53 │           253.04 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │    11217857 │   0.34 │    14.63 │           139.49 │     1.81 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │    12258738 │   0.29 │    15.37 │           146.60 │     1.73 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     4291109 │   0.00 │     2.05 │            19.55 │    12.94 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     4235848 │   0.00 │     2.02 │            19.31 │    13.11 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

Replace method overwriting with a dispatch singleton pattern so the MPI extension can precompile normally (__precompile__(false) removed). All hooks (global_dot, scalar_halo!, wallBC_L!, etc.) now dispatch through par_mode[] — serial defines _foo(::Serial), the extension adds _foo(::Parallel). Also removes set_comm! in favor of init_waterlily_mpi as the single entry point, and fixes stale measure! docstring.

Move MPI global offset from inside AutoBody.sdf/measure (not GPU-safe due to par_mode[] dispatch) to Simulation construction time. The new _apply_offset wraps the body's map function with a concrete offset SVector, keeping GPU kernels free of abstract dispatch. - AutoBody.jl: revert sdf/measure to master signatures (plain x, no global_offset calls), add _apply_offset(::AutoBody, offset) - Body.jl: add _apply_offset default (no-op) and SetBody method, fix d²=T(2+ϵ)^2 type conversion (was Int, now matches master) - WaterLily.jl: apply offset in Simulation constructor, update global_offset docstring - BC.jl: rename uBC→U in catch-all BC! method to match master

weymouth

I think I need to read a "design philosophy" document to figure out why you've made these design choices. Can you (or Claude) put that together. I'm getting lost trying to figure out what's been moved where and why. I gave up after a few files.

weymouth · 2026-04-13T07:41:43Z

 end
 AutoBody(sdf, map=(x,t)->x) = AutoBody(sdf, map)
+_apply_offset(body::AutoBody, offset) =
+    AutoBody(body.sdf, let m=body.map, o=offset; (x,t)->m(x .+ o, t); end)


I'm confused why this would be needed.

MPI grid offset for each rank to have the correct SDF. But this might need some better abstraction.

weymouth · 2026-04-13T07:43:02Z

+local_perdot(a,b,::Tuple{}) = a⋅b
+local_perdot(a,b,perdir,R=inside(a)) = @view(a[R])⋅@view(b[R])
+global_perdot(a,b,tup::Tuple{}) = _global_perdot(a, b, tup, par_mode[])
+global_perdot(a,b,perdir,R=inside(a)) = _global_perdot(a, b, perdir, R, par_mode[])


functions names don't match the doc string.

weymouth · 2026-04-13T08:21:05Z

+"""
+    wallBC_L!(L, perdir=())
+
+Zero the Poisson conductivity `L` at physical (non-periodic) boundary faces.


that's a bad docstring indeed..

weymouth · 2026-04-13T08:23:13Z

-function measure!(a::Flow{N,T},body::AbstractBody;t=zero(T),ϵ=1) where {N,T}
-    a.V .= zero(T); a.μ₀ .= one(T); a.μ₁ .= zero(T); d²=T(2+ϵ)^2
-    measure_sdf!(a.σ, body, t; fastd²=d²) # measure separately to allow specialization
+function measure!(body::AbstractBody,bc::AbstractBC{N,T};t=zero(T),ϵ=1) where {N,T}


I'm confused. Why is Flow now an "AbstractBC"? The core unit is now a BC? Also, the first argument should be the modified argument, that's Flow.

Yes. This is aligned with what we discussed in #228. All the BDIM fields have been moved from Flow to BC struct, so we dispatch on BC. This is why I suggested to start the review without the MPI part. I could also split these PRs otherwise, and first we could go through abstract BCs, then MPI.

I just read that thread and I don't see anything about splitting the BDIM stuff out of Flow in there. Can you explain why this helps with "future proofing"? I don't have a problem with it - but it isn't obvious to me how this helps the organization.

It seems more natural to have all the BC related fields (external or body) live within the same BC struct. Also, function like measure! would need to pass both Flow and BC otherwise.

IGG tightly couples the ghost cell treatment with the specific field, right? That is kind of the opposite, pulling all the BC information out. Bundling all the field data in one struct, and all the BC data for all of those fields in another struct.

Can you walk me through how this will work better to extend stuff in the future? I'm missing the organizational concept.

The need of an AbstractBC and specific BC structs was to dispatch to user-defined BCs such as Biot-Savart. The choice of putting BDIM fields in there is entirely optional, but simplifies some function calls since Flow does not need to be passed anymore. It does not impact performance, and is really not related for the IGG hooks.

The original idea of this PR was to create a ParallelBC used to dispatch to MPI (IGG). But after some iterations, I thought it was easier to add hook functions in serial code (resolving to nothing) that would be dispatched to the halo updated on the IGG extension. So this means that the BC abstraction is only useful for the user-defined BCs in the end, and it is not important where BDIM fields live... This can be reverted, it is not critical for this PR.

weymouth · 2026-04-13T08:25:25Z

    b::Tb
 end
+_apply_offset(body::SetBody, offset) =
+    SetBody(body.op, _apply_offset(body.a, offset), _apply_offset(body.b, offset))


Can you explain this choice? It's not obvious why this is a good idea. There are a lot of functions that use global coordinates. apply, interp. moment, etc.

It was a first attempt to let users specify the SDF without having to worry about the MPI grid offset. Might be cleaners ways to do it though.

I have pushed some small new changes that should now take care of that. Is hard to implement something that is GPU compatible (needs isbits) or does not trigger recompilation.

omlins · 2026-04-14T08:09:46Z

That's great to hear, many thanks! IGG is really the perfect fit for WaterLily, and we are fully on board on integrating it in WaterLily 2.0, which has the milestone of memory-distributed parallelism. So the less workarounds we have to (not as efficiently) implement ourselves, the better :) Please let me know if we can help in any manner, even if just for testing!

Regarding testing, I would like to propose to use this opportunity to create directly some reusable performance tests for WaterLily using PerfTest.jl. Concretely, once the new multi-grid support feature is implemented and passes all tests on IGG side, you could create a PR in WaterLily using the new feature, write functional units tests for it and generate some performance tests using PerfTest.jl. This way we would be sure that the new feature performs well in the context of WaterLily, and have a reusable performance test suite that will signal any future performance regressions. On IGG side, that would be helpful too as it would be a structured way to verify the performance of the new feature in a real-world application. In addition, it would give feedback for potential refinements of PerfTest.jl.

Note that @Dvegrod is the main developer of PerfTest.jl (he started this project as the practical part of his Master's thesis and is now continuing the project as part of his PhD: https://dl.acm.org/doi/abs/10.1145/3801098), so he is very happy to contribute to make this happen.

Please let us know how you would like to proceed!

b-fg · 2026-04-14T08:40:02Z

Hi Sam, of course, that would be great! We have our performance benchmark suite in WaterLily-Benchmarks, which tests performance regressions/speedups on full simulations. But I see that PerfTest.jl does performance check on unit tests, so that would definitely fit in "WaterLily.jl/tests/perftest.jl". Indeed, we can set it up once this PR is a bit more mature, we add the necessary MPI-related tests, and we have the IGG support for GMG :)

For the moment, maybe we could start by selecting some critical serial unit tests in test/maintests.jl and pass them to PerfTest for serial performance check?

Many thanks!

omlins · 2026-04-15T09:50:26Z

For the moment, maybe we could start by selecting some critical serial unit tests in test/maintests.jl and pass them to PerfTest for serial performance check?

@Dvegrod ?

b-fg · 2026-04-15T09:59:39Z

Btw note the new clean implementation in #281. This PR was trying to do too many things at the same time!

Dvegrod · 2026-04-15T14:57:12Z

Hi! I would be very happy to assist 😄. I think it could be quite handy to track down performance regression sources and/or proper resource utilization. And feedback on the tool would be of course useful. I remain available for any inquiries or to help on future unit performance test additions.

The basic idea of PerfTest.jl is that you can define performance tests in a simple way, similar to how you would do functional ones, with test sets and a set of macros. (They can even be defined in the same file with the same test sets as the functional suite, if desired). I leave here a link to the repo, and the docs.

b-fg added 4 commits July 4, 2025 18:26

First version. A bug requires remeasure=true to work, needs fixing.

5031ff3

First working version. All tests passing. Needs improve docs, and som…

8dd35db

…e cleanup.

Improved docs.

0a59e5c

Fixed test.

915e21e

b-fg changed the title ~~Abstract boundary conditions~~ Abstract flow and boundary conditions Jul 7, 2025

b-fg marked this pull request as ready for review July 7, 2025 22:38

Trigger Build

8b7c030

Fix single thread allocations test.

bf912e9

b-fg added 2 commits December 28, 2025 19:18

Merge branch 'master' into abstract_BCs

4886bd6

RigidMap tests update.

d8acbe4

b-fg added 3 commits April 5, 2026 00:16

Working 2D version MPI

3dda605

Working MPI 2D with new PLAN.md, simplified MPI.

b23df6a

Fixed exitBC and improves mpi_halo_update by reducing allocations.

528231d

b-fg changed the title ~~Abstract flow and boundary conditions~~ MPI, abstract flow, and abstract boundary conditions Apr 8, 2026

b-fg marked this pull request as draft April 8, 2026 14:06

b-fg changed the title ~~MPI, abstract flow, and abstract boundary conditions~~ MPI via ImplicitGlobalGrid.jl + abstract flow and BCs Apr 9, 2026

b-fg added 3 commits April 11, 2026 18:16

Bugs fix

c4344b9

init_waterlily_mpi automatic initiallizes everything for the user. SD…

4db3319

…F can now be define as usual

b-fg added 4 commits April 12, 2026 00:13

Remove pressureBC indirection, call scalar_halo directly

cc39715

pressureBC was a dispatch point for ParallelBC which no longer exists. Replace the sole call site in project with scalar_halo(b.x) and delete the function and its export.

b-fg added 3 commits April 12, 2026 11:29

Removed globa_dot from CUDA ext

abea2c8

weymouth reviewed Apr 13, 2026

View reviewed changes

b-fg added 2 commits April 13, 2026 21:46

Fix exitBC dispatch

76a37b6

Added location MPI offset on functions that need it

12815b0

b-fg mentioned this pull request Apr 15, 2026

Add MPI parallelism via ImplicitGlobalGrid extension #281

Open

10 tasks

Conversation

b-fg commented Jul 7, 2025

Uh oh!

b-fg commented Jul 7, 2025

Uh oh!

codecov bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

b-fg commented Jul 7, 2025

Uh oh!

b-fg commented Jul 8, 2025

Uh oh!

b-fg commented Dec 28, 2025

Uh oh!

weymouth commented Dec 28, 2025

Uh oh!

b-fg commented Apr 10, 2026

Uh oh!

luraess commented Apr 10, 2026

Uh oh!

b-fg commented Apr 10, 2026

Uh oh!

omlins commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b-fg commented Apr 10, 2026

Uh oh!

b-fg commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weymouth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omlins commented Apr 14, 2026

Uh oh!

b-fg commented Apr 14, 2026

Uh oh!

omlins commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b-fg commented Apr 15, 2026

Uh oh!

Dvegrod commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

codecov bot commented Jul 7, 2025 •

edited

Loading

omlins commented Apr 10, 2026 •

edited

Loading

b-fg commented Apr 12, 2026 •

edited

Loading

weymouth left a comment •

edited

Loading

omlins commented Apr 15, 2026 •

edited

Loading