Skip to content

Non-uniform boundary conditions#165

Merged
weymouth merged 34 commits intoWaterLily-jl:masterfrom
marinlauber:add-nonuniform-BCs
Apr 19, 2025
Merged

Non-uniform boundary conditions#165
weymouth merged 34 commits intoWaterLily-jl:masterfrom
marinlauber:add-nonuniform-BCs

Conversation

@marinlauber
Copy link
Copy Markdown
Member

@marinlauber marinlauber commented Sep 13, 2024

This pull request adds the capability of using space and time-varying inflow boundary conditions as well as prescribing space-time body force to the momentum step.

The major change is that now the u_BC::Function passed to the Flow is of the form u_BC(i,x,t) where i is the component, x the location in physical space and t the time. The flow is then accelerated correctly, and the boundary conditions on the velocity field are applied correctly. This allows to prescribe velocity profiles at the inlet and outlet.

Another feature grabbed from #174 is to enable passing either a body_force::AbstractArray or body_force::Function=(i,x,t)->... to mom_step! and sim_step!, which allows the user to either prescribe some custom term or accelerate the flow locally.

@marinlauber marinlauber mentioned this pull request Dec 2, 2024
@marinlauber
Copy link
Copy Markdown
Member Author

marinlauber commented Feb 27, 2025

From some basic tests. The issue was the definition of the function for the non-uniform velocity BC (it was type unstable, which allocated a lot).
Additionally, passing an unspecified U the BC! function was also type unstable; here, I have made two different functions, and internally, one passes to the other.

The only real difference in the allocation is in accelerate, where the the @loop is a bit slower than brodcasting.

using WaterLily
using StaticArrays
using ForwardDiff
import WaterLily: accelerate!,@loop,size_u,slice
using BenchmarkTools

# classical accelerate!
# accelerate!(r,dt,g::Function,::Tuple,t=sum(dt)) = for i ∈ 1:last(size(r))
    # r[..,i] .+= g(i,t)
# end

accelerate_loop!(r,dt,g::Function,::Tuple,t=sum(dt)) = for i  1:last(size(r))
    @loop r[I,i] += g(i,t) over I  CartesianIndices(Base.front(size(r)))
end

new_accelerate!(r,g::Function,t) = for i  1:last(size(r))
    @loop r[I,i] += g(i,loc(i,I,eltype(r)),t) over I  CartesianIndices(Base.front(size(r)))
end

# some array to test this
N = 2^7
u = zeros(Float32,N,N,N,3);

# utils
fun(i,t)=sin(2π*t)*cos(2π*t)*cos(2π*t)
fun2(i,x,t) = sin(2π*t)*cos(2π*t)*cos(2π*t)
fun3(i,x,t) = sin(2π*x[1])*cos(2π*x[2])*cos(2π*x[3])
tu=(); ts=[1.0]; t=0

# old method
@btime accelerate!($u,$ts,$fun,$tu) # 5.265 ms (9 allocations: 24.00 MiB)
# same but loop
@btime accelerate_loop!($u,$ts,$fun,$tu) # 9.490 ms (849 allocations: 67.02 KiB)
# new method
@btime new_accelerate!($u,$fun2,$t) # 9.516 ms (849 allocations: 66.94 KiB)
@btime new_accelerate!($u,$fun3,$t) # 10.829 ms (849 allocations: 66.94 KiB)

# the new BC function
new_BC!(a,U,saveexit=false,perdir=(),t=0) = new_BC!(a,(i,x,t)->U[i],saveexit,perdir,t)
function new_BC!(a,uBC::Function,saveexit=false,perdir=(),t=0)
    N,n = size_u(a)
    for i  1:n, j  1:n
        if j in perdir
            @loop a[I,i] = a[CIj(j,I,N[j]-1),i] over I  slice(N,1,j)
            @loop a[I,i] = a[CIj(j,I,2),i] over I  slice(N,N[j],j)
        else
            if i==j # Normal direction, Dirichlet
                for s  (1,2)
                    @loop a[I,i] = uBC(i,loc(i,I),t) over I  slice(N,s,j)
                end
                (!saveexit || i>1) && (@loop a[I,i] = uBC(i,loc(i,I),t) over I  slice(N,N[j],j)) # overwrite exit
            else    # Tangential directions, Neumann
                @loop a[I,i] = a[I+δ(j,I),i] over I  slice(N,1,j)
                @loop a[I,i] = a[I-δ(j,I),i] over I  slice(N,N[j],j)
            end
        end
    end
end

# very simple BCs
U = SA[1.0,0.,0.]
Ubc(i,I,t) = i==1 ? 1 : 0

# utils
x = SA[10.0,10.0,10.0]; t = 0.0; conv_exit=false
tu = (); i = 1

# test the bcs
@btime BC!($u,$U,$conv_exit,$tu) # 1.025 ms μs (5905 allocations: 477.30 KiB)
# new function with an array
@btime new_BC!($u,$U,$conv_exit,$tu,$t) # 1.038 ms (5905 allocations: 477.30 KiB)
# new function with a function
@btime new_BC!($u,$Ubc,$conv_exit,$tu,$t) # 770.551 μs (5878 allocations: 468.30 KiB)
@btime Ubc($i,$x,$t) # 1.031 ns (0 allocations: 0 bytes)

# more complex function
function u_pipe(i,x::SVector{3,T},t::T) where T
    i  1 && return zero(T)
    r = sum(abs2,SA[x[2],x[3]].-64.0)
    return 2r>125 ? zero(T) : convert(T,1-r^2/16384)
end
tf32 =0.f0
@btime new_BC!($u,$u_pipe,$conv_exit,$tu,$tf32) # 780.826 μs (5905 allocations: 468.72 KiB)
@btime u_pipe($i,$x,$t) # 2.512 ns (0 allocations: 0 bytes)

@marinlauber marinlauber marked this pull request as ready for review February 27, 2025 14:50
@marinlauber
Copy link
Copy Markdown
Member Author

@weymouth @b-fg let me know what you think. It seems that we could maybe clean a bit more the acceleration! / body_force! interface.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/util.jl 72.22% 5 Missing ⚠️
Files with missing lines Coverage Δ
src/Flow.jl 100.00% <100.00%> (ø)
src/Metrics.jl 93.47% <100.00%> (ø)
src/WaterLily.jl 96.29% <100.00%> (-0.26%) ⬇️
src/util.jl 81.98% <72.22%> (+0.79%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Mar 3, 2025

Just had a quick look and I like it! Also, I guess this PR solves #174 too, right? Happy to close that one if so. Also I like this body_force implementation, which can be array or function. It will be useful for including turbulence models ;)

@marinlauber
Copy link
Copy Markdown
Member Author

marinlauber commented Mar 3, 2025

Yes, this fixes #174 (the body_force part of it).

There is still one part of the test failing, which is related to constructing the simulation with a uBC(i,t), and internally transforming that into a uBC(i,x,t). The behaviour is breaking the impulsive cylinder force test.

Additionally, I am not sure if we want to keep the possibility of constructing the sim with three different velocity BC, Tuple, Function{Int,Number} and Function{Int,SVector,Number} or if we change everything to the second function type.

@marinlauber
Copy link
Copy Markdown
Member Author

Just had a quick look and I like it! Also, I guess this PR solves #174 too, right? Happy to close that one if so. Also I like this body_force implementation, which can be array or function. It will be useful for including turbulence models ;)

That was the idea: simple body force can be implemented as g(i,x,t), but more complex like turbulence models need to access Flow variables, which not easy to implement just with a function ;)

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Mar 3, 2025

The only problem now is that passing a time varying body force which needs to make use of instantaneous flow field data (ie velocity) into sim_step! is not possible. That would require an instantaneous body_force::AbstractArray implementation, but currently the body_force in sim_step! is not updated until t_end. One could use the "single-step" sim_step! though.. but there needs to be some indication on how to use this.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Mar 3, 2025

I have debugged your test error. Actually, the tests already fail on GPU for the "Flow.jl with increasing body force" test. The is nonbits error is arising from this new line
https://github.com/WaterLily-jl/WaterLily.jl/pull/165/files#diff-83b57561857c34d538577fdf7a3f03622f409f3dc9279d2fbfc5b9cf070daad7R72
Apparently, the GPU does not like the u_BC = uBC aliasing, instead just passing it directly uλ = (i,x)->u_BC[i] works.
I have not debugged further (the error you were referring to), but I would fix that first and then move to the next one. Remember to always pass tests on both CPU and GPU!

@marinlauber
Copy link
Copy Markdown
Member Author

@b-fg I didn't like this aliasing anyway, I'll try to find a better way to accommodate the backward compatibility. Do you have any ideas on how to do that?
I was debbuging on my laptop this weekend, so I didn't test on the GPU ;)

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Mar 4, 2025

Ha, I was working on a fix too. Are your tests ok on the GPU now?

b-fg added 4 commits March 4, 2025 13:31
…type for Simulation and Flow (Flow was Float64 previously).
…m_step with arbitrary keyword arguments.

The function can modify the ::Flow object and is called both in the predictor and corrector steps.
Added tests for its implementation of an increasing body force, but using udf instead of body_force.
All tests passing on Array and CuArray.
@b-fg
Copy link
Copy Markdown
Member

b-fg commented Mar 4, 2025

The latest commit implements a general user-defined function (UDF) that is called both in predictor and corrector steps. This gives users the flexibility to implement any type of instantaneous and local function in their own scripts, with arbitrary keyword arguments, and pass it into the solver.

I do not love the fact that all g, body_force, and udf can be passed into sim_step!, but the current implementation is fairly clean otherwise. Maybe we can try to unify g and body_force since currently both of these features have a very similar utility, although them being separated also has added benefits. Thoughts @marinlauber @weymouth @TzuYaoHuang ?

@marinlauber
Copy link
Copy Markdown
Member Author

I think body_force can go, because if it's only an (i,x,t) function, we can make g like that and apply it if needed or not. Then udf is the only additional argument to the mom_step function.

@weymouth
Copy link
Copy Markdown
Member

weymouth commented Apr 1, 2025 via email

@marinlauber
Copy link
Copy Markdown
Member Author

I'll do the performance benchmark. It was fine when we first introduced the u_BC so it should be OK.

@marinlauber
Copy link
Copy Markdown
Member Author

marinlauber commented Apr 1, 2025

Testing Running tests...
[ Info: Test backends: Array, CuArray
Test Summary: | Pass  Total   Time
util.jl       |   54     54  41.3s
Test Summary: | Pass  Total   Time
Poisson.jl    |   14     14  27.4s
Test Summary:        | Pass  Total   Time
MultiLevelPoisson.jl |   13     13  11.3s
Test Summary: | Pass  Total   Time
Flow.jl       |   30     30  22.8s
Test Summary: | Pass  Total  Time
Body.jl       |    3      3  0.0s
Test Summary: | Pass  Total   Time
AutoBody.jl   |   16     16  13.0s
Test Summary:        | Pass  Total  Time
Flow.jl periodic TGV |    2      2  8.7s
Test Summary: | Pass  Total   Time
ForwardDiff   |    2      2  39.8s
Test Summary:                      | Pass  Total   Time
Flow.jl with increasing body force |    4      4  15.5s
Test Summary:       | Pass  Total  Time
Boundary Layer Flow |    2      2  7.7s
Test Summary:            | Pass  Total  Time
Rotating reference frame |    1      1  2.9s
Test Summary:               | Pass  Total   Time
Circle in accelerating flow |    8      8  10.4s
Test Summary: | Pass  Total   Time
Metrics.jl    |   37     37  16.8s
Test Summary: | Pass  Total  Time
WaterLily.jl  |   30     30  7.8s
Test Summary: | Pass  Total   Time
VTKExt.jl     |   28     28  21.9s
     Testing WaterLily tests passed 
▶ Allocated 8 KiB
▶ Allocated 11 KiB
Test Summary:         | Pass  Total   Time
mom_step! allocations |    2      2  33.2s
     Testing WaterLily tests passed 

the results of the benchmark ( I had to run them separately since the TGV benchmark requires some changes between the two version)

sh benchmark.sh -w "master 01d9a24" -v "1.11" -t "1 4" -b "Array CuArray" -c "tgv jelly" -p "6,7 5,6" -s "100 100" -ft "Float32 Float64"
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.4 │   Float32 │       809360.006.40244.061.00 │
│     CPUx01 │             master │ 1.11.4 │   Float32 │       800210.004.30163.921.49 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.4 │   Float32 │     23438620.004.04154.001.58 │
│     CPUx04 │             master │ 1.11.4 │   Float32 │     23150930.0010.67407.080.60 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.4 │   Float32 │     29940530.000.5922.5910.80 │
│ GPU-NVIDIA │             master │ 1.11.4 │   Float32 │     28282700.000.5420.4411.94 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.4 │   Float32 │       757080.0043.59207.881.00 │
│     CPUx01 │             master │ 1.11.4 │   Float32 │       748160.0030.43145.111.43 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.4 │   Float32 │     21810490.0027.45130.911.59 │
│     CPUx04 │             master │ 1.11.4 │   Float32 │     21532030.0047.59226.900.92 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.4 │   Float32 │     27381380.002.4111.4818.10 │
│ GPU-NVIDIA │             master │ 1.11.4 │   Float32 │     25739810.002.1110.0720.64 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Figure stored in /home/marin/Workspace/WaterLily-Benchmarks/plots/tgv_cost_add-nonuniform-BCs_master_1.11.4_Float32.pdf
Figure stored in /home/marin/Workspace/WaterLily-Benchmarks/plots/tgv_benchmark_add-nonuniform-BCs_master_1.11.4_Float32.pdf
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.4 │   Float64 │      1730270.007.59578.721.00 │
│     CPUx01 │             master │ 1.11.4 │   Float64 │      1725270.0014.011069.160.54 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.4 │   Float64 │     48034750.677.56576.851.00 │
│     CPUx04 │             master │ 1.11.4 │   Float64 │     48047750.4113.04994.860.58 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.4 │   Float64 │     71264261.611.1890.116.42 │
│ GPU-NVIDIA │             master │ 1.11.4 │   Float64 │     68457801.881.3099.275.83 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.4 │   Float64 │      2241410.0042.21402.551.00 │
│     CPUx01 │             master │ 1.11.4 │   Float64 │      2236410.0089.73855.700.47 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.4 │   Float64 │     63096100.1141.80398.621.01 │
│     CPUx04 │             master │ 1.11.4 │   Float64 │     63136100.1159.71569.470.71 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.4 │   Float64 │     97271690.414.7845.588.83 │
│ GPU-NVIDIA │             master │ 1.11.4 │   Float64 │     94155380.545.2249.798.09 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Copy Markdown
Member

weymouth commented Apr 2, 2025

Why in the world is the PR twice as fast for the Jelly? There should be no changes for that test right?

And a 50% slow down for the TGV is a problem. Is that the cost of applying (and then ignoring) the domain BC function?

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 2, 2025

I'd focus on the GPU timings, these are typically more consistent, and the difference is not much here. I can also run them to double-check.

Also, I see you run the TGV in Float32 and the Jelly in Float64. Better run both cases with both precisions to see what's going on.

@marinlauber
Copy link
Copy Markdown
Member Author

Yeah, they are wired. I can also try to run the benchmark on Nefertem of DelftBlue; it might be a fairer comparison, I ran those on my workstation but you never know what process are in the background as well.

@b-fg I use the first example command line for the benchmark, do you have a "better" one?

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 2, 2025

So that I run the same benchmark as you, what exact function for TGV are you using now for this PR? (the WaterLily-Benchamarks TGV case only works for master).

@marinlauber
Copy link
Copy Markdown
Member Author

I've modified the original one to this

function tgv(p, backend; Re=1600, T=Float32)
    L = 2^p; U = T(1); κ=T/L); ν = T(1/*Re))
    function (i,xyz,t)
        x,y,z = @. xyz*κ
        i==1 && return -U*sin(x)*cos(y)*cos(z)
        i==2 && return  U*cos(x)*sin(y)*cos(z)
        return 0*U
    end
    Simulation((L, L, L), Uλ, 1/κ; U=U, ν=ν, T=T, mem=backend)
end

And I was wondering if this case should not be triple-periodic?

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 2, 2025

return 0*U smart haha.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 2, 2025

This is what I got

Details
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       80936 │   0.00 │     6.93 │           264.33 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       80021 │   0.00 │     4.23 │           161.24 │     1.64 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2343862 │   0.00 │     4.30 │           163.89 │     1.61 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2315093 │   0.00 │     3.21 │           122.45 │     2.16 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2994144 │   0.00 │     0.71 │            27.00 │     9.79 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     2828362 │   0.00 │     0.61 │            23.10 │    11.44 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       75708 │   0.00 │    44.25 │           211.00 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       74816 │   0.00 │    26.27 │           125.28 │     1.68 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2181049 │   0.00 │    26.16 │           124.72 │     1.69 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.98 │            95.29 │     2.21 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2738125 │   0.00 │     3.68 │            17.54 │    12.03 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     2573948 │   0.00 │     3.15 │            15.04 │    14.03 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Figure stored in /home/b-fg/Workspace/tudelft1/WaterLily-Benchmarks/plots/tgv_cost_add-nonuniform-BCs_master_1.11.3_Float32.pdf
Figure stored in /home/b-fg/Workspace/tudelft1/WaterLily-Benchmarks/plots/tgv_benchmark_add-nonuniform-BCs_master_1.11.3_Float32.pdf
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │      161455 │   0.00 │     3.80 │           289.55 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │      160955 │   0.00 │     3.79 │           289.13 │     1.00 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     4456091 │   0.51 │     3.82 │           291.56 │     0.99 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     4457391 │   0.44 │     4.17 │           318.20 │     0.91 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     5697955 │   0.00 │     1.03 │            78.77 │     3.68 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     5410254 │   0.00 │     1.04 │            79.28 │     3.65 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │      208466 │   0.00 │    27.52 │           262.47 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │      207966 │   0.00 │    26.87 │           256.21 │     1.02 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     5839681 │   0.13 │    21.00 │           200.27 │     1.31 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     5843681 │   0.12 │    21.32 │           203.30 │     1.29 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     7679971 │   0.40 │     3.91 │            37.29 │     7.04 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     7369919 │   0.38 │     3.97 │            37.83 │     6.94 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Figure stored in /home/b-fg/Workspace/tudelft1/WaterLily-Benchmarks/plots/jelly_cost_add-nonuniform-BCs_master_1.11.3_Float32.pdf
Figure stored in /home/b-fg/Workspace/tudelft1/WaterLily-Benchmarks/plots/jelly_benchmark_add-nonuniform-BCs_master_1.11.3_Float32.pdf

Looking at the GPU results, there is indeed something like 10-15% slowdown on TGV (and indeed severe slowdown on CPU). Is it because the function is now used to set the BCs at every time step? Of course evaluating that function (also used for setting initial condition) is more expensive that just getting a value from a Tuple (ie. U=(0,0,0)), as we have currently in master for this test case.

@weymouth
Copy link
Copy Markdown
Member

weymouth commented Apr 3, 2025

It has to either be this or the evaluation of the acceleration term, right? Those are the only two changes.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 13, 2025

I am working on a hacky way to fix the performance issues using BC buffers computed during pre-processing. I will push the changes and if, you like this, we can clean it up together. This won't fix the (unavoidable) cost of space-time varying BC functions, but it should behave like master otherwise.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 13, 2025

Doing some tests with the pre-processing of BCs, I get the same behaviour as we are observing. Complex functions yield severe slowdown on serial CPU (mutli-threaded CPU or GPU does not care much since we work with D-1 slices). Here are some results (full MWE in details):

T = Float32
mem = CuArray
a = rand(T,150,100,50,3) |> mem
N,n = size_u(a)
normal_buffers, tangential_buffers = get_buffers(a, Uλ; T, mem)

b, c = copy(a), copy(a)
apply_BCs!(b, Uλ)
apply_BCs!(c, normal_buffers, tangential_buffers)
@assert isapprox(b, c, atol=1e-5)

@btime apply_BCs!($b, $Uλ) # CPU -t 1: 3.571 ms (54 allocations: 864 bytes) | GPU: 84.264 μs (936 allocations: 34.69 KiB)
@btime apply_BCs!($c, $normal_buffers, $tangential_buffers) # CPU -t 1: 184.782 μs (84 allocations: 1.88 KiB) | GPU: 102.383 μs (1197 allocations: 43.55 KiB)
Details
using Revise, WaterLily, CUDA, StaticArrays, BenchmarkTools

@inline CI(a...) = CartesianIndex(a...)
CIj(j,I::CartesianIndex{d},k) where d = CI(ntuple(i -> i==j ? k : I[i], d))
δ(i,::Val{N}) where N = CI(ntuple(j -> j==i ? 1 : 0, N))
δ(i,I::CartesianIndex{N}) where N = δ(i, Val{N}())
function slice(dims::NTuple{N},i,j,low=1) where N
    CartesianIndices(ntuple( k-> k==j ? (i:i) : (low:dims[k]), N))
end
@inline loc(i,I::CartesianIndex{N},T=Float32) where N = SVector{N,T}(I.I .- 1.5 .- 0.5 .* δ(i,I).I)
@inline loc(Ii::CartesianIndex,T=Float32) = loc(last(Ii),Base.front(Ii),T)
splitn(n) = Base.front(n),last(n)
size_u(u) = splitn(size(u))

function (i,xyz,t)
    x,y,z = @. xyz
    i==1 && return -sin(x)*cos(y)*cos(z)
    i==2 && return  cos(x)*sin(y)*cos(z)
    return 0
end

function apply_BCs!(a, u_BC::Function)
    N,n = size_u(a)
    for i  1:n, j  1:n
        if i==j # Normal direction, Dirichlet
            for s  (1,2)
                WaterLily.@loop a[I,i] = u_BC(i,loc(i,I),0) over I  slice(N,s,j)
            end
            # if i>1
                WaterLily.@loop a[I,i] = u_BC(i,loc(i,I),0) over I  slice(N,N[j],j)
            # end
        else    # Tangential directions, Neumann
            WaterLily.@loop a[I,i] = u_BC(i,loc(i,I),0)+a[I+δ(j,I),i]-u_BC(i,loc(i,I+δ(j,I)),0) over I  slice(N,1,j)
            WaterLily.@loop a[I,i] = u_BC(i,loc(i,I),0)+a[I-δ(j,I),i]-u_BC(i,loc(i,I-δ(j,I)),0) over I  slice(N,N[j],j)
        end
    end
end

function apply_BCs!(a, normal_buffers, tangential_buffers)
    N,n = size_u(a)
    for i  1:n, j  1:n
        if i==j # Normal direction, Dirichlet
            for s  (1,2)
                b = normal_buffers[i][s]
                WaterLily.@loop a[I,i] = b[I-δ(j,I)*(s-1)] over I  slice(N,s,j)
            end
            # if i>1
                b = normal_buffers[i][3]
                WaterLily.@loop a[I,i] = b[I-δ(i,I)*(N[i]-1)] over I  slice(N,N[j],j)
            # end
        else    # Tangential directions, Neumann
            b = tangential_buffers[i,j][1]
            WaterLily.@loop a[I,i] = a[I+δ(j,I),i] + b[I] over I  slice(N,1,j)
            b = tangential_buffers[i,j][2]
            WaterLily.@loop a[I,i] = a[I-δ(j,I),i] + b[I-δ(j,I)*(N[j]-1)] over I  slice(N,N[j],j)
        end
    end
end

function get_buffers(a, u_BC; T=Float32, mem=Array)
    N,n = size_u(a)
    normal_buffers = collect(mem{T,n}[] for i  1:n)
    tangential_buffers = collect(mem{T,n}[] for i  1:n, j  1:n)
    for i  1:n, j  1:n
        b = zeros(eltype(a), size(slice(N,1,j))...) |> mem
        if i == j
            for s  (1,2)
                WaterLily.@loop b[I-δ(j,I)*(s-1)] = u_BC(i,loc(i,I),0) over I  slice(N,s,j)
                push!(normal_buffers[i], copy(b))

            end
            WaterLily.@loop b[I-δ(j,I)*(N[j]-1)] = u_BC(i,loc(i,I),0) over I  slice(N,N[j],j)
            push!(normal_buffers[i], copy(b))
        else
            WaterLily.@loop b[I] = u_BC(i,loc(i,I),0)-u_BC(i,loc(i,I+δ(j,I)),0) over I  slice(N,1,j)
            push!(tangential_buffers[i,j], copy(b))
            WaterLily.@loop b[I-δ(j,I)*(N[j]-1)] = u_BC(i,loc(i,I),0)-u_BC(i,loc(i,I-δ(j,I)),0) over I  slice(N,N[j],j)
            push!(tangential_buffers[i,j], copy(b))
        end
    end
    return normal_buffers, tangential_buffers
end

T = Float32
mem = CuArray
a = rand(T,150,100,50,3) |> mem
N,n = size_u(a)
normal_buffers, tangential_buffers = get_buffers(a, Uλ; T, mem)

b, c = copy(a), copy(a)
apply_BCs!(b, Uλ)
apply_BCs!(c, normal_buffers, tangential_buffers)
@assert isapprox(b, c, atol=1e-5)

@btime apply_BCs!($b, $Uλ) # CPU -t 1: 3.571 ms (54 allocations: 864 bytes) | GPU: 84.264 μs (936 allocations: 34.69 KiB)
@btime apply_BCs!($c, $normal_buffers, $tangential_buffers) # CPU -t 1: 184.782 μs (84 allocations: 1.88 KiB) | GPU: 102.383 μs (1197 allocations: 43.55 KiB)

@marinlauber
Copy link
Copy Markdown
Member Author

I think, regardless of how well the BC kernel is launched, we will never be able to get the same evaluation time for Tuple or Function. Should we type-dispatch to two different functions to keep the fast runtime for Tuple?

@marinlauber
Copy link
Copy Markdown
Member Author

What I was thinking is to steal the @vecloop from BiotSavartBCs and try to launch one/two kernel per BC calls, one for the normal and one for the tangential component.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 14, 2025

I like the @vecloop approach. I was also already dispatching by Tuple of Function (not shown in the MWE though). So I will try to push these changes and you maybe can try to add the @vecloop?

@weymouth
Copy link
Copy Markdown
Member

weymouth commented Apr 14, 2025 via email

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 14, 2025

I have implemented the BC buffers computed during pre-processing. All tests locally passing on CPU and GPU, and I have added an additional extra test in @testset "util.jl" as well. Then, when I was checking performance, I saw that we do not improve much anyways (master is still much faster). So for the TGV case, I set up all the BC to be periodic, so that the whole BC shenanigans are bypassed in both master and current PR. Then I saw that master is still ~30% faster on serial CPU. This indicated that there is something else going on, right? Maybe is the new dispatch? On the other hand, the performance on the jelly test case is fine..

Below is the benchmark results for of TGV periodic for serial CPU, where 01d9a24 is the current state of this PR, add-nonuniform-BCs is the latest BC buffer implementation (still not committed here), and then master.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌─────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │            01d9a24 │ 1.11.3 │   Float32 │      410697 │   0.00 │    67.85 │           323.52 │     1.00 │
│  CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │      410697 │   0.00 │    68.93 │           328.68 │     0.98 │
│  CPUx01 │             master │ 1.11.3 │   Float32 │      406772 │   0.00 │    51.55 │           245.80 │     1.32 │
└─────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 14, 2025

I think I know what's going on... With the current setup, accelerate is being called when running the TGV since

accelerate!(r,t,::Nothing,U::Function) = accelerate!(r,t,(i,x,t)->ForwardDiff.derivative->U(i,x,τ),t))

where accelerate!(r,t,g,U)
-.-

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 14, 2025

How should we handle this? Currently accelerate uses both g and :

accelerate!(r,t,::Nothing,::Union{Nothing,Tuple}) = nothing
accelerate!(r,t,f::Function) = @loop r[Ii] += f(last(Ii),loc(Ii,eltype(r)),t) over Ii  CartesianIndices(r)
accelerate!(r,t,g::Function,::Union{Nothing,Tuple}) = accelerate!(r,t,g)
accelerate!(r,t,::Nothing,U::Function) = accelerate!(r,t,(i,x,t)->ForwardDiff.derivative->U(i,x,τ),t))
accelerate!(r,t,g::Function,U::Function) = accelerate!(r,t,(i,x,t)->g(i,x,t)+ForwardDiff.derivative->U(i,x,τ),t))

I am not sure what combination we should actually support. Let me know your thoughts.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 14, 2025

(Also we need to think why this was not caught by the tests).

@weymouth
Copy link
Copy Markdown
Member

weymouth commented Apr 15, 2025

The acceleration is required when the user supplies a background velocity since it implies an accelerating reference frame. That background velocity should also be used to set the initial velocity. We have test to make sure this happens...

However, the TGV is just an initial condition, not a background velocity. More logic is needed in the input conditions, I guess...

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 15, 2025

I think we are trying to squeeze to much stuff in : IC, BC, accelerating frame. I think it makes sense to have a different argument for IC (as we had before), since BC and accelerating frame need to be in sync anyways.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 15, 2025

I am thinking to introduce this (breaking) change:

function Simulation(dims::NTuple{N}, U₀, Uλ;
                    L=1, U=nothing, Δt=0.25, ν=0., g=nothing, ϵ=1, perdir=(),
                    exitBC=false, body::AbstractBody=NoBody(),
                    T=Float32, mem=Array) where N
    @assert !(isnothing(U) && isa(Uλ,Function)) "`U` (velocity scale) must be specified if `Uλ` is a `Function`"
    isnothing(U) && (U = sum(abs2,Uλ))
    check_fn(g,N,T); check_fn(Uλ,N,T)
    flow = Flow(dims,U₀,Uλ;Δt,ν,g,T,f=mem,perdir,exitBC)
    measure!(flow,body;ϵ)
    new(U,L,ϵ,flow,body,MultiLevelPoisson(flow.p,flow.μ₀,flow.σ;perdir))
end

So when creating a Simulation users always have to specify the initial condition (U₀) and the boundary conditions () (seems reasonable from a physics standpoint). If Uλ::Function, then this will also used during accelerate! (even when g::Nothing), to apply the background velocity.

Alternatively, we could also have as a keyword argument and use U₀ to define the BC if is not defined.

If you like this I can update the code base and the tests. WaterLily-Examples should also be updated. So it would be reasonable to bump WaterLily to 1.4 as well. Let me know!

@marinlauber
Copy link
Copy Markdown
Member Author

But this is kind of what we had before, u_BC was always specified, Function or Tuple, and the user could decide to use or not to initialise the fields.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 15, 2025

Yes, and I am not sure how to do this without using another argument in Simulation(...). Why was this removed in first place? Was it causing a conflict or just trying to simplify it for the user?

…leaned up across Simulation and Flow. All tests passing locally on CPU and GPU.
@b-fg
Copy link
Copy Markdown
Member

b-fg commented Apr 16, 2025

After a short conversation with Gabe, I have re-introduce the different arguments uBC (boundary conditions and non-uniform background flow) and (initial condition) when creating a Simulation. uBC can be a Tuple or a Function of (i,x,t), and can be a Tuple or a Function of (i,x). If is not specified, it defaults to uBC(i,x,0), or just uBC::Tuple.

I took the chance to homogenize some of these arguments across the different objects (Simulation, Flow). All test are currently passing for CPU and GPU. Benchmarks are shown in details, and the performance is the same as master for TGV and Jelly cases. So this PR should be pretty much done.

I might make some performance tests to check the impact of non-uniform BCs and see if it using BC buffers, when uBC is only a function of space, is worth it or not. But that will be a separate PR, if any.

Details
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.27 │           162.84 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       80021 │   0.00 │     4.27 │           163.07 │     1.00 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2329993 │   0.00 │     3.18 │           121.38 │     1.34 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2315093 │   0.00 │     3.20 │           122.19 │     1.33 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2978759 │   0.00 │     0.60 │            22.88 │     7.12 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     2828263 │   0.00 │     0.57 │            21.90 │     7.43 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       75316 │   0.00 │    26.66 │           127.10 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       74816 │   0.00 │    26.46 │           126.15 │     1.01 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2168103 │   0.00 │    18.56 │            88.48 │     1.44 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    18.81 │            89.68 │     1.42 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2724354 │   0.00 │     3.12 │            14.86 │     8.55 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     2573717 │   0.00 │     3.15 │            15.02 │     8.46 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       75436 │   0.00 │     2.06 │           156.96 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       74936 │   0.00 │     2.10 │           160.01 │     0.98 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2173964 │   0.00 │     2.20 │           167.82 │     0.94 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2159064 │   0.00 │     2.19 │           167.45 │     0.94 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2796042 │   0.00 │     0.50 │            38.13 │     4.12 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     2645540 │   0.00 │     0.50 │            38.07 │     4.12 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │     WaterLily      │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │       85065 │   0.00 │    15.01 │           143.18 │     1.00 │
│     CPUx01 │             master │ 1.11.3 │   Float32 │       84565 │   0.00 │    15.29 │           145.83 │     0.98 │
│     CPUx04 │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     2474753 │   0.00 │    12.64 │           120.51 │     1.19 │
│     CPUx04 │             master │ 1.11.3 │   Float32 │     2459853 │   0.00 │    12.32 │           117.54 │     1.22 │
│ GPU-NVIDIA │ add-nonuniform-BCs │ 1.11.3 │   Float32 │     3216477 │   0.00 │     1.95 │            18.64 │     7.68 │
│ GPU-NVIDIA │             master │ 1.11.3 │   Float32 │     3066785 │   0.00 │     1.97 │            18.79 │     7.62 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Copy Markdown
Member

Sounds good to me

@weymouth weymouth merged commit 6bb77c2 into WaterLily-jl:master Apr 19, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants