Skip to content

MPI via ImplicitGlobalGrid.jl + abstract flow and BCs#236

Draft
b-fg wants to merge 23 commits intomasterfrom
abstract_BCs
Draft

MPI via ImplicitGlobalGrid.jl + abstract flow and BCs#236
b-fg wants to merge 23 commits intomasterfrom
abstract_BCs

Conversation

@b-fg
Copy link
Copy Markdown
Member

@b-fg b-fg commented Jul 7, 2025

First implementation of abstract boundary conditions (BCs), which are now called during project!. Needs improvements in docs and some cleanup.

@b-fg b-fg changed the title Abstract boundary conditions Abstract flow and boundary conditions Jul 7, 2025
@b-fg b-fg marked this pull request as ready for review July 7, 2025 22:38
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Jul 7, 2025

This is a first working version. Feel free to suggests edits! We should also create a Biot-Savart simulation to showcase this feature.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/BC.jl 100.00% <100.00%> (ø)
src/Body.jl 100.00% <100.00%> (ø)
src/Flow.jl 100.00% <100.00%> (ø)
src/WaterLily.jl 85.71% <100.00%> (+0.42%) ⬆️
src/util.jl 80.31% <100.00%> (+6.46%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Jul 7, 2025

Benchmarks for reference

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.14 │           158.02 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.13 │           157.73 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     2273557 │   0.00 │     2.90 │           110.45 │     1.43 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     2273557 │   0.00 │     2.96 │           113.01 │     1.40 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     2982617 │   0.00 │     0.58 │            22.05 │     7.15 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     2982520 │   0.00 │     0.58 │            21.98 │     7.18 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.92 │           123.60 │     0.99 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.72 │           122.63 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     2116285 │   0.00 │    17.63 │            84.05 │     1.46 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     2116285 │   0.00 │    17.35 │            82.73 │     1.48 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     2728564 │   0.00 │     3.03 │            14.44 │     8.49 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     2728455 │   0.00 │     3.03 │            14.46 │     8.48 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.47 │           264.98 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.47 │           264.43 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     4358996 │   0.66 │     3.28 │           250.55 │     1.06 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     4358808 │   0.63 │     3.26 │           248.74 │     1.06 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     5678335 │   0.00 │     0.89 │            68.04 │     3.89 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     5680922 │   0.00 │     0.91 │            69.59 │     3.80 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │  WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.3 │   Float32 │        8307 │   0.00 │    25.72 │           245.31 │     1.00 │
│     CPUx01 │       master │ 1.11.3 │   Float32 │        8307 │   0.00 │    25.63 │           244.39 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.3 │   Float32 │     5699444 │   0.16 │    16.83 │           160.49 │     1.52 │
│     CPUx04 │       master │ 1.11.3 │   Float32 │     5699880 │   0.16 │    16.76 │           159.85 │     1.53 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.3 │   Float32 │     7625557 │   0.59 │     3.24 │            30.89 │     7.91 │
│ GPU-NVIDIA │       master │ 1.11.3 │   Float32 │     7625469 │   0.60 │     3.20 │            30.49 │     8.02 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Jul 8, 2025

Closes #228

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Dec 28, 2025

This branch is ready to implement CoupledSim @weymouth. BCs are applied within project!, instead of within mom_step!, in the new BC type.

@weymouth
Copy link
Copy Markdown
Member

I'll take a look this week.

@b-fg b-fg changed the title Abstract flow and boundary conditions MPI, abstract flow, and abstract boundary conditions Apr 8, 2026
@b-fg b-fg marked this pull request as draft April 8, 2026 14:06
@b-fg b-fg changed the title MPI, abstract flow, and abstract boundary conditions MPI via ImplicitGlobalGrid.jl + abstract flow and BCs Apr 9, 2026
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 10, 2026

Hi @luraess @omlins ! This is a crude Claude-assisted implementation of MPI in WaterLily using ImplicitGlobalGrid. IGG is amazingly working out of the box, so kudos for that!

I wanted to ask about a particular design choice we need to make in our solver. Since we use a geometric multigrid method in the pressure solver, communication needs to happen not only in our original grid size, for also at the coarser levels when halving the grid iteratively in the V-cycle of the GMG solver. Is this something that could be handled by IGG? Effectively, we need to update the halos of different array sizes. Maybe we could have a IGG grid for each of the GMG grid levels? Currently, there is a manual MPI workaround for this part, but I would rather let IGG handle all MPI communications in the solver, including GMG, if possible.

Thanks!

@luraess
Copy link
Copy Markdown

luraess commented Apr 10, 2026

That's something we had started thinking about and discussing with @omlins and should have an IGG option for it eventually.

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 10, 2026

Great! Looking forward to that!

@omlins
Copy link
Copy Markdown

omlins commented Apr 10, 2026

Hi @b-fg : Thanks for sharing this. I'm glad to hear that you're enjoying IGG :)

I've been thinking about this for years actually, exactly for this use case: GMG. The only reason it has not yet been implemented is that I had not yet restarted working with GMG, nor somebody else has asked for it...

@Dvegrod had started to work on that in the scope of his Masters thesis that I supervised - two years ago -, but did not get to finish it at that time as this was not the focus of his thesis: eth-cscs/ImplicitGlobalGrid.jl#88

I just asked him about that and he said that he's going to finish this now :)

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 10, 2026

That's great to hear, many thanks! IGG is really the perfect fit for WaterLily, and we are fully on board on integrating it in WaterLily 2.0, which has the milestone of memory-distributed parallelism. So the less workarounds we have to (not as efficiently) implement ourselves, the better :) Please let me know if we can help in any manner, even if just for testing!

b-fg added 3 commits April 11, 2026 18:16
      wallBC_L! only zeroed L at the left wall (face 3) but not the right wall
      (face N[j]-1), leaving the Poisson operator coupled to ghost cells on the
      right boundary. This caused serial pressure to blow up to ~68 after ~30
      steps. The old L≡μ₀ sharing masked this because BC!(μ₀,...) zeroed both
      walls implicitly.

      Also fix pin_pressure! to sum only interior cells instead of the full
      array (which includes ghost cells).
b-fg added 4 commits April 12, 2026 00:13
  Replace generator-based reductions with LinearAlgebra.⋅ and @views
  via new dot_R/sum_R helpers, which work on both CPU and GPU arrays.
  Both accept an optional R keyword for the reduction range. MPI
  extension calls the serial helper locally then Allreduce, avoiding
  function body duplication. Simplify pin_pressure! accordingly.
  Eliminate the ParallelBC type and maybe_parallel wrapper by moving
  halo exchange calls into the serial code via existing no-op hooks
  (velocity_halo in BC and measure). The MPI extension now only
  overrides wallBC_L, exitBC, divisible, and the global reduction
  hooks — no more BC, measure, or update overrides needed.
  pressureBC was a dispatch point for ParallelBC which no longer exists.
  Replace the sole call site in project with scalar_halo(b.x) and
  delete the function and its export.
Brings in master changes: GaussSeidelRB! smoother, perdot for periodic
PCG, measure_sdf! separation, relaxation parameter ω in multigrid,
allocation-free restrictL!.

Conflict resolutions:
- AutoBody.jl: keep master's chain-rule measure refactor with global_offset
- Body.jl: fix a.σ → bc.σ (abstract_BCs moved fields to BC struct)
- BC.jl: add local_perdot/global_perdot hooks, fix positional arg call
- Poisson.jl: merge scalar_halo!/global_dot hooks with new GS-RB and ω
- MultiLevelPoisson.jl: merge wallBC_L! call with new ω and GS-RB
- util.jl: keep local_dot/local_sum hooks, drop old dot_R/sum_R
- MPI ext: qualify local_dot/local_sum, add global_perdot override
- maintests.jl: use global_perdot, relaxed pois.n assertion, bc.V
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 12, 2026

Current benchmarks vs WL 1.6.1 serial. In general it is pretty similar, but there are some weird speedups/slowdowns. Maybe coming from measure, but needs to be further investigated.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        4607 │   0.00 │     3.61 │           137.61 │     1.25 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4675 │   0.00 │     4.51 │           172.05 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2216107 │   0.00 │     2.84 │           108.16 │     1.59 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     2039313 │   0.00 │     2.75 │           105.05 │     1.64 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2361907 │   0.00 │     0.45 │            17.27 │     9.97 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2264335 │   0.00 │     0.41 │            15.68 │    10.97 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        5087 │   0.00 │    27.61 │           131.67 │     1.03 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4607 │   0.00 │    28.38 │           135.35 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2381177 │   0.00 │    17.60 │            83.93 │     1.61 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     2032707 │   0.00 │    16.78 │            80.00 │     1.69 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2558162 │   0.00 │     2.89 │            13.77 │     9.83 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2248609 │   0.00 │     2.65 │            12.62 │    10.72 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.06 │           137.76 │     1.03 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        3807 │   0.00 │     4.19 │           142.15 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     1959107 │   0.00 │     2.52 │            85.60 │     1.66 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     1748907 │   0.00 │     2.39 │            81.12 │     1.75 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2043505 │   0.00 │     0.42 │            14.25 │     9.97 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     1927507 │   0.00 │     0.37 │            12.54 │    11.34 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        4207 │   0.00 │    35.95 │           152.37 │     1.09 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        4207 │   0.00 │    39.14 │           165.90 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     2091355 │   0.00 │    17.72 │            75.12 │     2.21 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     1888707 │   0.00 │    17.24 │            73.07 │     2.27 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     2208107 │   0.00 │     3.00 │            12.70 │    13.06 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     2093308 │   0.00 │     2.87 │            12.16 │    13.65 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        9019 │   0.00 │     4.17 │           318.47 │     0.84 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        5817 │   0.00 │     3.52 │           268.26 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │     7058383 │   0.89 │     3.24 │           247.23 │     1.09 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │     5269745 │   0.63 │     2.83 │           216.01 │     1.24 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     5342419 │   0.00 │     0.72 │            55.21 │     4.86 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     3297948 │   0.00 │     0.45 │            34.54 │     7.77 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │    WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ abstract_BCs │ 1.11.5 │   Float32 │        5647 │   0.00 │    21.34 │           203.51 │     1.24 │
│     CPUx01 │       v1.6.1 │ 1.11.5 │   Float32 │        7863 │   0.00 │    26.53 │           253.04 │     1.00 │
│     CPUx04 │ abstract_BCs │ 1.11.5 │   Float32 │    11217857 │   0.34 │    14.63 │           139.49 │     1.81 │
│     CPUx04 │       v1.6.1 │ 1.11.5 │   Float32 │    12258738 │   0.29 │    15.37 │           146.60 │     1.73 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │   Float32 │     4291109 │   0.00 │     2.05 │            19.55 │    12.94 │
│ GPU-NVIDIA │       v1.6.1 │ 1.11.5 │   Float32 │     4235848 │   0.00 │     2.02 │            19.31 │    13.11 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg added 3 commits April 12, 2026 11:29
  Replace method overwriting with a dispatch singleton pattern so the MPI
  extension can precompile normally (__precompile__(false) removed). All
  hooks (global_dot, scalar_halo!, wallBC_L!, etc.) now dispatch through
  par_mode[] — serial defines _foo(::Serial), the extension adds
  _foo(::Parallel). Also removes set_comm! in favor of init_waterlily_mpi
  as the single entry point, and fixes stale measure! docstring.
Move MPI global offset from inside AutoBody.sdf/measure (not GPU-safe
due to par_mode[] dispatch) to Simulation construction time. The new
_apply_offset wraps the body's map function with a concrete offset
SVector, keeping GPU kernels free of abstract dispatch.

- AutoBody.jl: revert sdf/measure to master signatures (plain x, no
  global_offset calls), add _apply_offset(::AutoBody, offset)
- Body.jl: add _apply_offset default (no-op) and SetBody method,
  fix d²=T(2+ϵ)^2 type conversion (was Int, now matches master)
- WaterLily.jl: apply offset in Simulation constructor, update
  global_offset docstring
- BC.jl: rename uBC→U in catch-all BC! method to match master
Copy link
Copy Markdown
Member

@weymouth weymouth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to read a "design philosophy" document to figure out why you've made these design choices. Can you (or Claude) put that together. I'm getting lost trying to figure out what's been moved where and why. I gave up after a few files.

Comment thread src/AutoBody.jl
end
AutoBody(sdf, map=(x,t)->x) = AutoBody(sdf, map)
_apply_offset(body::AutoBody, offset) =
AutoBody(body.sdf, let m=body.map, o=offset; (x,t)->m(x .+ o, t); end)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this would be needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI grid offset for each rank to have the correct SDF. But this might need some better abstraction.

Comment thread src/BC.jl
Comment on lines +33 to +36
local_perdot(a,b,::Tuple{}) = a⋅b
local_perdot(a,b,perdir,R=inside(a)) = @view(a[R])⋅@view(b[R])
global_perdot(a,b,tup::Tuple{}) = _global_perdot(a, b, tup, par_mode[])
global_perdot(a,b,perdir,R=inside(a)) = _global_perdot(a, b, perdir, R, par_mode[])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functions names don't match the doc string.

Comment thread src/BC.jl
"""
wallBC_L!(L, perdir=())

Zero the Poisson conductivity `L` at physical (non-periodic) boundary faces.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a bad docstring indeed..

Comment thread src/Body.jl
function measure!(a::Flow{N,T},body::AbstractBody;t=zero(T),ϵ=1) where {N,T}
a.V .= zero(T); a.μ₀ .= one(T); a.μ₁ .= zero(T); d²=T(2+ϵ)^2
measure_sdf!(a.σ, body, t; fastd²=d²) # measure separately to allow specialization
function measure!(body::AbstractBody,bc::AbstractBC{N,T};t=zero(T),ϵ=1) where {N,T}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why is Flow now an "AbstractBC"? The core unit is now a BC? Also, the first argument should be the modified argument, that's Flow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is aligned with what we discussed in #228. All the BDIM fields have been moved from Flow to BC struct, so we dispatch on BC. This is why I suggested to start the review without the MPI part. I could also split these PRs otherwise, and first we could go through abstract BCs, then MPI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read that thread and I don't see anything about splitting the BDIM stuff out of Flow in there. Can you explain why this helps with "future proofing"? I don't have a problem with it - but it isn't obvious to me how this helps the organization.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more natural to have all the BC related fields (external or body) live within the same BC struct. Also, function like measure! would need to pass both Flow and BC otherwise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IGG tightly couples the ghost cell treatment with the specific field, right? That is kind of the opposite, pulling all the BC information out. Bundling all the field data in one struct, and all the BC data for all of those fields in another struct.

Can you walk me through how this will work better to extend stuff in the future? I'm missing the organizational concept.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need of an AbstractBC and specific BC structs was to dispatch to user-defined BCs such as Biot-Savart. The choice of putting BDIM fields in there is entirely optional, but simplifies some function calls since Flow does not need to be passed anymore. It does not impact performance, and is really not related for the IGG hooks.

The original idea of this PR was to create a ParallelBC used to dispatch to MPI (IGG). But after some iterations, I thought it was easier to add hook functions in serial code (resolving to nothing) that would be dispatched to the halo updated on the IGG extension. So this means that the BC abstraction is only useful for the user-defined BCs in the end, and it is not important where BDIM fields live... This can be reverted, it is not critical for this PR.

Comment thread src/Body.jl
b::Tb
end
_apply_offset(body::SetBody, offset) =
SetBody(body.op, _apply_offset(body.a, offset), _apply_offset(body.b, offset))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this choice? It's not obvious why this is a good idea. There are a lot of functions that use global coordinates. apply, interp. moment, etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a first attempt to let users specify the SDF without having to worry about the MPI grid offset. Might be cleaners ways to do it though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed some small new changes that should now take care of that. Is hard to implement something that is GPU compatible (needs isbits) or does not trigger recompilation.

@omlins
Copy link
Copy Markdown

omlins commented Apr 14, 2026

That's great to hear, many thanks! IGG is really the perfect fit for WaterLily, and we are fully on board on integrating it in WaterLily 2.0, which has the milestone of memory-distributed parallelism. So the less workarounds we have to (not as efficiently) implement ourselves, the better :) Please let me know if we can help in any manner, even if just for testing!

Regarding testing, I would like to propose to use this opportunity to create directly some reusable performance tests for WaterLily using PerfTest.jl. Concretely, once the new multi-grid support feature is implemented and passes all tests on IGG side, you could create a PR in WaterLily using the new feature, write functional units tests for it and generate some performance tests using PerfTest.jl. This way we would be sure that the new feature performs well in the context of WaterLily, and have a reusable performance test suite that will signal any future performance regressions. On IGG side, that would be helpful too as it would be a structured way to verify the performance of the new feature in a real-world application. In addition, it would give feedback for potential refinements of PerfTest.jl.

Note that @Dvegrod is the main developer of PerfTest.jl (he started this project as the practical part of his Master's thesis and is now continuing the project as part of his PhD: https://dl.acm.org/doi/abs/10.1145/3801098), so he is very happy to contribute to make this happen.

Please let us know how you would like to proceed!

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 14, 2026

Hi Sam, of course, that would be great! We have our performance benchmark suite in WaterLily-Benchmarks, which tests performance regressions/speedups on full simulations. But I see that PerfTest.jl does performance check on unit tests, so that would definitely fit in "WaterLily.jl/tests/perftest.jl". Indeed, we can set it up once this PR is a bit more mature, we add the necessary MPI-related tests, and we have the IGG support for GMG :)

For the moment, maybe we could start by selecting some critical serial unit tests in test/maintests.jl and pass them to PerfTest for serial performance check?

Many thanks!

@omlins
Copy link
Copy Markdown

omlins commented Apr 15, 2026

For the moment, maybe we could start by selecting some critical serial unit tests in test/maintests.jl and pass them to PerfTest for serial performance check?

@Dvegrod ?

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 15, 2026

Btw note the new clean implementation in #281. This PR was trying to do too many things at the same time!

@Dvegrod
Copy link
Copy Markdown

Dvegrod commented Apr 15, 2026

Hi! I would be very happy to assist 😄. I think it could be quite handy to track down performance regression sources and/or proper resource utilization. And feedback on the tool would be of course useful. I remain available for any inquiries or to help on future unit performance test additions.

The basic idea of PerfTest.jl is that you can define performance tests in a simple way, similar to how you would do functional ones, with test sets and a set of macros. (They can even be defined in the same file with the same test sets as the functional suite, if desired). I leave here a link to the repo, and the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants