MPI via ImplicitGlobalGrid.jl + abstract flow and BCs#236
MPI via ImplicitGlobalGrid.jl + abstract flow and BCs#236
Conversation
|
This is a first working version. Feel free to suggests edits! We should also create a Biot-Savart simulation to showcase this feature. |
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
|
Benchmarks for reference |
|
Closes #228 |
|
This branch is ready to implement |
|
I'll take a look this week. |
|
Hi @luraess @omlins ! This is a crude Claude-assisted implementation of MPI in WaterLily using ImplicitGlobalGrid. IGG is amazingly working out of the box, so kudos for that! I wanted to ask about a particular design choice we need to make in our solver. Since we use a geometric multigrid method in the pressure solver, communication needs to happen not only in our original grid size, for also at the coarser levels when halving the grid iteratively in the V-cycle of the GMG solver. Is this something that could be handled by IGG? Effectively, we need to update the halos of different array sizes. Maybe we could have a IGG grid for each of the GMG grid levels? Currently, there is a manual MPI workaround for this part, but I would rather let IGG handle all MPI communications in the solver, including GMG, if possible. Thanks! |
|
That's something we had started thinking about and discussing with @omlins and should have an IGG option for it eventually. |
|
Great! Looking forward to that! |
|
Hi @b-fg : Thanks for sharing this. I'm glad to hear that you're enjoying IGG :) I've been thinking about this for years actually, exactly for this use case: GMG. The only reason it has not yet been implemented is that I had not yet restarted working with GMG, nor somebody else has asked for it... @Dvegrod had started to work on that in the scope of his Masters thesis that I supervised - two years ago -, but did not get to finish it at that time as this was not the focus of his thesis: eth-cscs/ImplicitGlobalGrid.jl#88 I just asked him about that and he said that he's going to finish this now :) |
|
That's great to hear, many thanks! IGG is really the perfect fit for WaterLily, and we are fully on board on integrating it in WaterLily 2.0, which has the milestone of memory-distributed parallelism. So the less workarounds we have to (not as efficiently) implement ourselves, the better :) Please let me know if we can help in any manner, even if just for testing! |
wallBC_L! only zeroed L at the left wall (face 3) but not the right wall
(face N[j]-1), leaving the Poisson operator coupled to ghost cells on the
right boundary. This caused serial pressure to blow up to ~68 after ~30
steps. The old L≡μ₀ sharing masked this because BC!(μ₀,...) zeroed both
walls implicitly.
Also fix pin_pressure! to sum only interior cells instead of the full
array (which includes ghost cells).
…F can now be define as usual
Replace generator-based reductions with LinearAlgebra.⋅ and @views via new dot_R/sum_R helpers, which work on both CPU and GPU arrays. Both accept an optional R keyword for the reduction range. MPI extension calls the serial helper locally then Allreduce, avoiding function body duplication. Simplify pin_pressure! accordingly.
Eliminate the ParallelBC type and maybe_parallel wrapper by moving halo exchange calls into the serial code via existing no-op hooks (velocity_halo in BC and measure). The MPI extension now only overrides wallBC_L, exitBC, divisible, and the global reduction hooks — no more BC, measure, or update overrides needed.
pressureBC was a dispatch point for ParallelBC which no longer exists. Replace the sole call site in project with scalar_halo(b.x) and delete the function and its export.
Brings in master changes: GaussSeidelRB! smoother, perdot for periodic PCG, measure_sdf! separation, relaxation parameter ω in multigrid, allocation-free restrictL!. Conflict resolutions: - AutoBody.jl: keep master's chain-rule measure refactor with global_offset - Body.jl: fix a.σ → bc.σ (abstract_BCs moved fields to BC struct) - BC.jl: add local_perdot/global_perdot hooks, fix positional arg call - Poisson.jl: merge scalar_halo!/global_dot hooks with new GS-RB and ω - MultiLevelPoisson.jl: merge wallBC_L! call with new ω and GS-RB - util.jl: keep local_dot/local_sum hooks, drop old dot_R/sum_R - MPI ext: qualify local_dot/local_sum, add global_perdot override - maintests.jl: use global_perdot, relaxed pois.n assertion, bc.V
|
Current benchmarks vs WL 1.6.1 serial. In general it is pretty similar, but there are some weird speedups/slowdowns. Maybe coming from Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 4607 │ 0.00 │ 3.61 │ 137.61 │ 1.25 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4675 │ 0.00 │ 4.51 │ 172.05 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 2216107 │ 0.00 │ 2.84 │ 108.16 │ 1.59 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 2039313 │ 0.00 │ 2.75 │ 105.05 │ 1.64 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 2361907 │ 0.00 │ 0.45 │ 17.27 │ 9.97 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2264335 │ 0.00 │ 0.41 │ 15.68 │ 10.97 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 5087 │ 0.00 │ 27.61 │ 131.67 │ 1.03 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4607 │ 0.00 │ 28.38 │ 135.35 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 2381177 │ 0.00 │ 17.60 │ 83.93 │ 1.61 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 2032707 │ 0.00 │ 16.78 │ 80.00 │ 1.69 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 2558162 │ 0.00 │ 2.89 │ 13.77 │ 9.83 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2248609 │ 0.00 │ 2.65 │ 12.62 │ 10.72 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 3807 │ 0.00 │ 4.06 │ 137.76 │ 1.03 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 3807 │ 0.00 │ 4.19 │ 142.15 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 1959107 │ 0.00 │ 2.52 │ 85.60 │ 1.66 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 1748907 │ 0.00 │ 2.39 │ 81.12 │ 1.75 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 2043505 │ 0.00 │ 0.42 │ 14.25 │ 9.97 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 1927507 │ 0.00 │ 0.37 │ 12.54 │ 11.34 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 4207 │ 0.00 │ 35.95 │ 152.37 │ 1.09 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 4207 │ 0.00 │ 39.14 │ 165.90 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 2091355 │ 0.00 │ 17.72 │ 75.12 │ 2.21 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 1888707 │ 0.00 │ 17.24 │ 73.07 │ 2.27 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 2208107 │ 0.00 │ 3.00 │ 12.70 │ 13.06 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 2093308 │ 0.00 │ 2.87 │ 12.16 │ 13.65 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 9019 │ 0.00 │ 4.17 │ 318.47 │ 0.84 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 5817 │ 0.00 │ 3.52 │ 268.26 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 7058383 │ 0.89 │ 3.24 │ 247.23 │ 1.09 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 5269745 │ 0.63 │ 2.83 │ 216.01 │ 1.24 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 5342419 │ 0.00 │ 0.72 │ 55.21 │ 4.86 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 3297948 │ 0.00 │ 0.45 │ 34.54 │ 7.77 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬──────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼──────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ CPUx01 │ abstract_BCs │ 1.11.5 │ Float32 │ 5647 │ 0.00 │ 21.34 │ 203.51 │ 1.24 │
│ CPUx01 │ v1.6.1 │ 1.11.5 │ Float32 │ 7863 │ 0.00 │ 26.53 │ 253.04 │ 1.00 │
│ CPUx04 │ abstract_BCs │ 1.11.5 │ Float32 │ 11217857 │ 0.34 │ 14.63 │ 139.49 │ 1.81 │
│ CPUx04 │ v1.6.1 │ 1.11.5 │ Float32 │ 12258738 │ 0.29 │ 15.37 │ 146.60 │ 1.73 │
│ GPU-NVIDIA │ abstract_BCs │ 1.11.5 │ Float32 │ 4291109 │ 0.00 │ 2.05 │ 19.55 │ 12.94 │
│ GPU-NVIDIA │ v1.6.1 │ 1.11.5 │ Float32 │ 4235848 │ 0.00 │ 2.02 │ 19.31 │ 13.11 │
└────────────┴──────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
|
Replace method overwriting with a dispatch singleton pattern so the MPI extension can precompile normally (__precompile__(false) removed). All hooks (global_dot, scalar_halo!, wallBC_L!, etc.) now dispatch through par_mode[] — serial defines _foo(::Serial), the extension adds _foo(::Parallel). Also removes set_comm! in favor of init_waterlily_mpi as the single entry point, and fixes stale measure! docstring.
Move MPI global offset from inside AutoBody.sdf/measure (not GPU-safe due to par_mode[] dispatch) to Simulation construction time. The new _apply_offset wraps the body's map function with a concrete offset SVector, keeping GPU kernels free of abstract dispatch. - AutoBody.jl: revert sdf/measure to master signatures (plain x, no global_offset calls), add _apply_offset(::AutoBody, offset) - Body.jl: add _apply_offset default (no-op) and SetBody method, fix d²=T(2+ϵ)^2 type conversion (was Int, now matches master) - WaterLily.jl: apply offset in Simulation constructor, update global_offset docstring - BC.jl: rename uBC→U in catch-all BC! method to match master
| end | ||
| AutoBody(sdf, map=(x,t)->x) = AutoBody(sdf, map) | ||
| _apply_offset(body::AutoBody, offset) = | ||
| AutoBody(body.sdf, let m=body.map, o=offset; (x,t)->m(x .+ o, t); end) |
There was a problem hiding this comment.
I'm confused why this would be needed.
There was a problem hiding this comment.
MPI grid offset for each rank to have the correct SDF. But this might need some better abstraction.
| local_perdot(a,b,::Tuple{}) = a⋅b | ||
| local_perdot(a,b,perdir,R=inside(a)) = @view(a[R])⋅@view(b[R]) | ||
| global_perdot(a,b,tup::Tuple{}) = _global_perdot(a, b, tup, par_mode[]) | ||
| global_perdot(a,b,perdir,R=inside(a)) = _global_perdot(a, b, perdir, R, par_mode[]) |
There was a problem hiding this comment.
functions names don't match the doc string.
| """ | ||
| wallBC_L!(L, perdir=()) | ||
|
|
||
| Zero the Poisson conductivity `L` at physical (non-periodic) boundary faces. |
There was a problem hiding this comment.
that's a bad docstring indeed..
| function measure!(a::Flow{N,T},body::AbstractBody;t=zero(T),ϵ=1) where {N,T} | ||
| a.V .= zero(T); a.μ₀ .= one(T); a.μ₁ .= zero(T); d²=T(2+ϵ)^2 | ||
| measure_sdf!(a.σ, body, t; fastd²=d²) # measure separately to allow specialization | ||
| function measure!(body::AbstractBody,bc::AbstractBC{N,T};t=zero(T),ϵ=1) where {N,T} |
There was a problem hiding this comment.
I'm confused. Why is Flow now an "AbstractBC"? The core unit is now a BC? Also, the first argument should be the modified argument, that's Flow.
There was a problem hiding this comment.
Yes. This is aligned with what we discussed in #228. All the BDIM fields have been moved from Flow to BC struct, so we dispatch on BC. This is why I suggested to start the review without the MPI part. I could also split these PRs otherwise, and first we could go through abstract BCs, then MPI.
There was a problem hiding this comment.
I just read that thread and I don't see anything about splitting the BDIM stuff out of Flow in there. Can you explain why this helps with "future proofing"? I don't have a problem with it - but it isn't obvious to me how this helps the organization.
There was a problem hiding this comment.
It seems more natural to have all the BC related fields (external or body) live within the same BC struct. Also, function like measure! would need to pass both Flow and BC otherwise.
There was a problem hiding this comment.
IGG tightly couples the ghost cell treatment with the specific field, right? That is kind of the opposite, pulling all the BC information out. Bundling all the field data in one struct, and all the BC data for all of those fields in another struct.
Can you walk me through how this will work better to extend stuff in the future? I'm missing the organizational concept.
There was a problem hiding this comment.
The need of an AbstractBC and specific BC structs was to dispatch to user-defined BCs such as Biot-Savart. The choice of putting BDIM fields in there is entirely optional, but simplifies some function calls since Flow does not need to be passed anymore. It does not impact performance, and is really not related for the IGG hooks.
The original idea of this PR was to create a ParallelBC used to dispatch to MPI (IGG). But after some iterations, I thought it was easier to add hook functions in serial code (resolving to nothing) that would be dispatched to the halo updated on the IGG extension. So this means that the BC abstraction is only useful for the user-defined BCs in the end, and it is not important where BDIM fields live... This can be reverted, it is not critical for this PR.
| b::Tb | ||
| end | ||
| _apply_offset(body::SetBody, offset) = | ||
| SetBody(body.op, _apply_offset(body.a, offset), _apply_offset(body.b, offset)) |
There was a problem hiding this comment.
Can you explain this choice? It's not obvious why this is a good idea. There are a lot of functions that use global coordinates. apply, interp. moment, etc.
There was a problem hiding this comment.
It was a first attempt to let users specify the SDF without having to worry about the MPI grid offset. Might be cleaners ways to do it though.
There was a problem hiding this comment.
I have pushed some small new changes that should now take care of that. Is hard to implement something that is GPU compatible (needs isbits) or does not trigger recompilation.
Regarding testing, I would like to propose to use this opportunity to create directly some reusable performance tests for WaterLily using PerfTest.jl. Concretely, once the new multi-grid support feature is implemented and passes all tests on IGG side, you could create a PR in WaterLily using the new feature, write functional units tests for it and generate some performance tests using PerfTest.jl. This way we would be sure that the new feature performs well in the context of WaterLily, and have a reusable performance test suite that will signal any future performance regressions. On IGG side, that would be helpful too as it would be a structured way to verify the performance of the new feature in a real-world application. In addition, it would give feedback for potential refinements of PerfTest.jl. Note that @Dvegrod is the main developer of PerfTest.jl (he started this project as the practical part of his Master's thesis and is now continuing the project as part of his PhD: https://dl.acm.org/doi/abs/10.1145/3801098), so he is very happy to contribute to make this happen. Please let us know how you would like to proceed! |
|
Hi Sam, of course, that would be great! We have our performance benchmark suite in WaterLily-Benchmarks, which tests performance regressions/speedups on full simulations. But I see that PerfTest.jl does performance check on unit tests, so that would definitely fit in "WaterLily.jl/tests/perftest.jl". Indeed, we can set it up once this PR is a bit more mature, we add the necessary MPI-related tests, and we have the IGG support for GMG :) For the moment, maybe we could start by selecting some critical serial unit tests in test/maintests.jl and pass them to PerfTest for serial performance check? Many thanks! |
@Dvegrod ? |
|
Btw note the new clean implementation in #281. This PR was trying to do too many things at the same time! |
|
Hi! I would be very happy to assist 😄. I think it could be quite handy to track down performance regression sources and/or proper resource utilization. And feedback on the tool would be of course useful. I remain available for any inquiries or to help on future unit performance test additions. The basic idea of PerfTest.jl is that you can define performance tests in a simple way, similar to how you would do functional ones, with test sets and a set of macros. (They can even be defined in the same file with the same test sets as the functional suite, if desired). I leave here a link to the repo, and the docs. |
First implementation of abstract boundary conditions (BCs), which are now called during
project!. Needs improvements in docs and some cleanup.