Skip to content

Commit 0ea8756

Browse files
authored
Merge branch 'main' into hagen_poiseuille_flow
2 parents d8d52db + 3a5900a commit 0ea8756

File tree

24 files changed

+461
-221
lines changed

24 files changed

+461
-221
lines changed

Project.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,17 @@ WriteVTK = "64499a7a-5c06-52f2-abe2-ccb03c286192"
3737
[weakdeps]
3838
OrdinaryDiffEq = "1dea7af3-3e70-54e6-95c3-0bf5283fa5ed"
3939
OrdinaryDiffEqCore = "bbf590c4-e513-4bbe-9b18-05decba2e5d8"
40+
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
4041

4142
[extensions]
4243
TrixiParticlesOrdinaryDiffEqExt = ["OrdinaryDiffEq", "OrdinaryDiffEqCore"]
44+
TrixiParticlesCUDAExt = "CUDA"
4345

4446
[compat]
4547
Accessors = "0.1.43"
4648
Adapt = "4"
4749
CSV = "0.10"
50+
CUDA = "5.9.1"
4851
DataFrames = "1.6"
4952
DelimitedFiles = "1"
5053
DiffEqCallbacks = "4"

docs/src/development.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,45 @@ To create a new release for TrixiParticles.jl, perform the following steps:
6666
version should be `v0.3.1-dev`. If you just released `v0.2.4`, the new development
6767
version should be `v0.2.5-dev`.
6868

69+
## [Writing GPU-compatible code](@id writing_gpu_code)
70+
71+
When implementing new functionality that should run on both CPUs and GPUs,
72+
follow these rules:
73+
74+
1. Data structures must be generic and parametric.
75+
Do not hardcode concrete CPU array types like `Vector` or `Matrix` in fields.
76+
Use type parameters, so the same structure can store CPU arrays and GPU arrays.
77+
2. Add an Adapt.jl rule in `src/general/gpu.jl`.
78+
Register the new type with `Adapt.@adapt_structure ...`, so `adapt` can recursively
79+
convert all arrays inside the structure to GPU arrays.
80+
This conversion is then applied automatically inside `semidiscretize`.
81+
3. Use `@threaded` for all loops.
82+
Accessing GPU arrays inside regular loops is not allowed.
83+
With a GPU backend, `@threaded` loops are compiled to GPU kernels.
84+
4. Write type-stable code and do not allocate inside `@threaded` loops.
85+
This is required for GPU kernels and is also essential for fast multithreaded CPU code.
86+
87+
## [Writing fast GPU code](@id writing_fast_gpu_code)
88+
89+
The following rules improve kernel performance and avoid common GPU pitfalls:
90+
91+
1. Avoid exceptions and bounds errors inside kernels.
92+
Perform all required checks before entering `@threaded` loops (that is, before GPU kernels).
93+
Then use `@inbounds` directly at the loop where bounds are guaranteed.
94+
In TrixiParticles.jl, we do not place `@inbounds` inside inner helper functions.
95+
Instead, mark helper functions with `@propagate_inbounds` so the loop-level
96+
`@inbounds` is propagated.
97+
2. Avoid implicit `Float64` literals in arithmetic.
98+
For example, prefer `x / 2` over `0.5 * x` so `Float32` simulations stay in `Float32`.
99+
Verify this with `@device_code`, or by confirming the kernel runs on an Apple GPU
100+
(most Apple GPUs do not support `Float64`).
101+
3. Use `div_fast` in performance-critical divisions, but only after benchmarking (!).
102+
It can significantly speed up kernels, but should not be applied indiscriminately.
103+
When introducing `div_fast` in code, add a reference to [this section](@ref writing_fast_gpu_code)
104+
to document the rationale and benchmarking context, e.g., like so:
105+
```julia
106+
# Since this is one of the most performance critical functions, using fast divisions
107+
# here gives a significant speedup on GPUs.
108+
# See the docs page "Development" for more details on `div_fast`.
109+
result = div_fast(dividend, divisor)
110+
```

docs/src/gpu.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,3 +137,8 @@ On GPUs that do not support `Float64`, such as most Apple GPUs, we also need to
137137
the coordinates to `Float32` by passing `coordinates_eltype=Float32` to
138138
the setup functions that create [`InitialCondition`](@ref)s, such as
139139
[`RectangularTank`](@ref), [`RectangularShape`](@ref), and [`SphereShape`](@ref).
140+
141+
## Writing GPU-compatible code
142+
143+
Please see the [development documentation](@ref writing_gpu_code) for guidelines on
144+
how to write GPU-compatible code.

ext/TrixiParticlesCUDAExt.jl

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# TODO this might be integrated into CUDA.jl at some point, see
2+
# https://github.com/JuliaGPU/CUDA.jl/pull/3077
3+
module TrixiParticlesCUDAExt
4+
5+
using CUDA: CUDA
6+
using TrixiParticles: TrixiParticles
7+
8+
# Use faster version of `div_fast` for `Float64` on CUDA.
9+
# By default, `div_fast` translates to `Base.FastMath.div_fast`, but there is
10+
# no fast division for `Float64` on CUDA, so we need to redefine it here to use the
11+
# improved fast reciprocal defined below.
12+
CUDA.@device_override TrixiParticles.div_fast(x, y::Float64) = x * fast_inv_cuda(y)
13+
14+
# Improved fast reciprocal for `Float64` by @Mikolaj-A-Kowalski, which is significantly
15+
# more accurate than just calling "llvm.nvvm.rcp.approx.ftz.d" without the cubic iteration,
16+
# while still being much faster than a full division.
17+
# This is copied from Oceananigans.jl, see https://github.com/CliMA/Oceananigans.jl/pull/5140.
18+
@inline function fast_inv_cuda(a::Float64)
19+
# Get the approximate reciprocal
20+
# https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rcp-approx-ftz-f64
21+
# This instruction chops off last 32bits of mantissa and computes inverse
22+
# while treating all subnormal numbers as 0.0
23+
# If reciprocal would be subnormal, underflows to 0.0
24+
# 32 least significant bits of the result are filled with 0s
25+
inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)
26+
27+
# Approximate the missing 32bits of mantissa with a single cubic iteration
28+
e = fma(inv_a, -a, 1.0)
29+
e = fma(e, e, e)
30+
inv_a = fma(e, inv_a, inv_a)
31+
return inv_a
32+
end
33+
34+
end # module

src/general/buffer.jl

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,10 @@ end
4040
# Dispatch by system type to handle systems that provide a buffer.
4141
@inline buffer(system) = nothing
4242

43-
@inline update_system_buffer!(buffer::Nothing, semi) = buffer
43+
@inline update_system_buffer!(buffer::Nothing) = buffer
4444

4545
# TODO `resize` allocates. Find a non-allocating version
46-
@inline function update_system_buffer!(buffer::SystemBuffer, semi)
46+
@inline function update_system_buffer!(buffer::SystemBuffer)
4747
(; active_particle) = buffer
4848

4949
# TODO: Parallelize (see https://github.com/trixi-framework/TrixiParticles.jl/issues/810)
@@ -64,7 +64,7 @@ end
6464
return view(buffer.eachparticle, 1:buffer.active_particle_count[])
6565
end
6666

67-
@inline function deactivate_particle!(system, particle, u)
67+
@inline function deactivate_particle!(system, particle, v, u)
6868
(; active_particle) = system.buffer
6969

7070
# Set particle far away from simulation domain
@@ -73,6 +73,14 @@ end
7373
u[dim, particle] = eltype(system)(1e16)
7474
end
7575

76+
# To ensure that the velocity of an inactive particle does not dominate the time step
77+
# in adaptive time integrators, set this velocity to zero.
78+
# Additionally, this enables map-reduce operations for `v_max` computation
79+
# without having to distinguish inactive particles.
80+
for dim in 1:ndims(system)
81+
v[dim, particle] = 0
82+
end
83+
7684
# `deactivate_particle!` and `active_particle[particle] = true`
7785
# are never called on the same buffer inside a kernel,
7886
# so we don't have any race conditions on this `active_particle` vector.

src/general/custom_quantities.jl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ Returns the total kinetic energy of all particles in a system.
66
function kinetic_energy(system, dv_ode, du_ode, v_ode, u_ode, semi, t)
77
v = wrap_v(v_ode, system, semi)
88

9-
# TODO: `current_velocity` should only contain active particles
10-
# (see https://github.com/trixi-framework/TrixiParticles.jl/issues/850)
119
velocity = reinterpret(reshape, SVector{ndims(system), eltype(v)},
1210
view(current_velocity(v, system), :,
1311
each_active_particle(system)))

src/io/write_vtk.jl

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -350,18 +350,13 @@ function write2vtk!(vtk, v, u, t, system::AbstractFluidSystem)
350350
rho_b = current_density(v, system, neighbor)
351351
grad_kernel = smoothing_kernel_grad(system, pos_diff, distance, particle)
352352

353-
surface_tension[1:ndims(system),
354-
particle] .+= surface_tension_force(surface_tension_a,
355-
surface_tension_b,
356-
system,
357-
system,
358-
particle,
359-
neighbor,
360-
pos_diff,
361-
distance,
362-
rho_a,
363-
rho_b,
364-
grad_kernel)
353+
dv_surface_tension = Ref(zero(pos_diff))
354+
surface_tension_force!(dv_surface_tension,
355+
surface_tension_a, surface_tension_b,
356+
system, system, particle, neighbor,
357+
pos_diff, distance, rho_a, rho_b, grad_kernel, 1)
358+
359+
surface_tension[1:ndims(system), particle] .+= dv_surface_tension[]
365360
end
366361
vtk["surface_tension"] = surface_tension
367362

src/schemes/boundary/open_boundary/dynamical_pressure.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ end
5454
end
5555

5656
@inline function density_calculator(system::OpenBoundarySystem{<:BoundaryModelDynamicalPressureZhang})
57-
return system.cache.density_calculator
57+
return ContinuityDensity()
5858
end
5959

6060
@inline impose_rest_density!(v, system, particle, boundary_model) = v

src/schemes/boundary/open_boundary/rhs.jl

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ function interact!(dv, v_particle_system, u_particle_system,
33
v_neighbor_system, u_neighbor_system,
44
particle_system::OpenBoundarySystem{<:BoundaryModelDynamicalPressureZhang},
55
neighbor_system, semi)
6-
(; fluid_system, cache, boundary_model) = particle_system
6+
(; fluid_system, cache) = particle_system
77

88
sound_speed = system_sound_speed(fluid_system)
99

@@ -59,13 +59,11 @@ function interact!(dv, v_particle_system, u_particle_system,
5959
v_diff = current_velocity(v_particle_system, particle_system, particle) -
6060
current_velocity(v_neighbor_system, neighbor_system, neighbor)
6161

62-
# Continuity equation
63-
@inbounds dv[end, particle] += rho_a / rho_b * m_b * dot(v_diff, grad_kernel)
64-
65-
density_diffusion!(dv, density_diffusion(particle_system),
66-
v_particle_system, particle, neighbor,
67-
pos_diff, distance, m_b, rho_a, rho_b,
68-
particle_system, grad_kernel)
62+
# Propagate `@inbounds` to the continuity equation, which accesses particle data
63+
@inbounds continuity_equation!(dv, particle_system, neighbor_system,
64+
v_particle_system, v_neighbor_system, particle,
65+
neighbor, pos_diff, distance, m_b, rho_a, rho_b,
66+
grad_kernel)
6967

7068
# Open boundary pressure evolution matches the corresponding fluid system:
7169
# - EDAC: Compute pressure evolution like the fluid system

src/schemes/boundary/open_boundary/system.jl

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,10 @@ function OpenBoundarySystem(boundary_zones::Union{BoundaryZone, Nothing}...;
7575
pressure_acceleration=fluid_system.pressure_acceleration_formulation,
7676
shifting_technique=boundary_model isa
7777
BoundaryModelDynamicalPressureZhang ?
78-
shifting_technique(fluid_system) : nothing)
78+
shifting_technique(fluid_system) : nothing,
79+
density_diffusion=boundary_model isa
80+
BoundaryModelDynamicalPressureZhang ?
81+
density_diffusion(fluid_system) : nothing)
7982
boundary_zones_ = filter(bz -> !isnothing(bz), boundary_zones)
8083

8184
initial_conditions = union((bz.initial_condition for bz in boundary_zones_)...)
@@ -90,7 +93,8 @@ function OpenBoundarySystem(boundary_zones::Union{BoundaryZone, Nothing}...;
9093
cache = (;
9194
create_cache_shifting(initial_conditions, shifting_technique)...,
9295
create_cache_open_boundary(boundary_model, fluid_system, initial_conditions,
93-
calculate_flow_rate, boundary_zones_)...)
96+
density_diffusion, calculate_flow_rate,
97+
boundary_zones_)...)
9498

9599
if any(pr -> isa(pr, RCRWindkesselModel), cache.pressure_reference_values)
96100
calculate_flow_rate = true
@@ -130,7 +134,7 @@ function initialize!(system::OpenBoundarySystem, semi)
130134
end
131135

132136
function create_cache_open_boundary(boundary_model, fluid_system, initial_condition,
133-
calculate_flow_rate, boundary_zones)
137+
density_diffusion, calculate_flow_rate, boundary_zones)
134138
reference_values = map(bz -> bz.reference_values, boundary_zones)
135139
ELTYPE = eltype(initial_condition)
136140

@@ -174,15 +178,14 @@ function create_cache_open_boundary(boundary_model, fluid_system, initial_condit
174178
# as it was already verified in `allocate_buffer` that the density array is constant.
175179
density_rest = first(initial_condition.density)
176180

177-
dd = density_diffusion(fluid_system)
178-
if dd isa DensityDiffusionAntuono
179-
density_diffusion_ = DensityDiffusionAntuono(initial_condition; delta=dd.delta)
181+
if density_diffusion isa DensityDiffusionAntuono
182+
density_diffusion_ = DensityDiffusionAntuono(initial_condition;
183+
delta=density_diffusion.delta)
180184
else
181-
density_diffusion_ = dd
185+
density_diffusion_ = density_diffusion
182186
end
183187

184-
cache = (; density_calculator=ContinuityDensity(),
185-
density_diffusion=density_diffusion_,
188+
cache = (; density_diffusion=density_diffusion_,
186189
pressure_boundary=pressure_boundary,
187190
density_rest=density_rest, cache...)
188191

@@ -261,6 +264,10 @@ system_sound_speed(system::OpenBoundarySystem) = system_sound_speed(system.fluid
261264

262265
@inline hydrodynamic_mass(system::OpenBoundarySystem, particle) = system.mass[particle]
263266

267+
@propagate_inbounds function current_velocity(v, system::OpenBoundarySystem)
268+
return view(v, 1:ndims(system), :)
269+
end
270+
264271
@inline function current_density(v, system::OpenBoundarySystem)
265272
return system.cache.density
266273
end
@@ -306,6 +313,9 @@ function update_boundary_interpolation!(system::OpenBoundarySystem, v, u, v_ode,
306313
semi, t)
307314
update_boundary_model!(system, system.boundary_model, v, u, v_ode, u_ode, semi, t)
308315
update_shifting!(system, shifting_technique(system), v, u, v_ode, u_ode, semi)
316+
317+
@trixi_timeit timer() "update density diffusion" update!(density_diffusion(system),
318+
v, u, system, semi)
309319
end
310320

311321
# This function is called by the `UpdateCallback`, as the integrator array might be modified
@@ -397,8 +407,8 @@ function check_domain!(system, v, u, v_ode, u_ode, semi)
397407
v, u, v_fluid, u_fluid)
398408
end
399409

400-
update_system_buffer!(system.buffer, semi)
401-
update_system_buffer!(fluid_system.buffer, semi)
410+
update_system_buffer!(system.buffer)
411+
update_system_buffer!(fluid_system.buffer)
402412

403413
fluid_candidates .= false
404414

@@ -428,8 +438,8 @@ function check_domain!(system, v, u, v_ode, u_ode, semi)
428438
v, u, v_fluid, u_fluid)
429439
end
430440

431-
update_system_buffer!(system.buffer, semi)
432-
update_system_buffer!(fluid_system.buffer, semi)
441+
update_system_buffer!(system.buffer)
442+
update_system_buffer!(fluid_system.buffer)
433443

434444
# Since particles have been transferred, the neighborhood searches must be updated
435445
update_nhs!(semi, u_ode)
@@ -453,7 +463,7 @@ end
453463
# to determine if it exited the boundary zone through the free surface (outflow).
454464
if dot(relative_position, boundary_zone.face_normal) < 0
455465
# Particle is outside the fluid domain
456-
deactivate_particle!(system, particle, u)
466+
deactivate_particle!(system, particle, v, u)
457467

458468
return system
459469
end
@@ -475,7 +485,7 @@ end
475485
# Verify the particle remains inside the boundary zone after the reset; deactivate it if not.
476486
particle_coords = current_coords(u, system, particle)
477487
if !is_in_boundary_zone(boundary_zone, particle_coords)
478-
deactivate_particle!(system, particle, u)
488+
deactivate_particle!(system, particle, v, u)
479489

480490
return system
481491
end
@@ -494,7 +504,7 @@ end
494504
transfer_particle!(system, fluid_system, particle, particle_new, v, u, v_fluid, u_fluid)
495505

496506
# Deactivate particle in interior domain
497-
deactivate_particle!(fluid_system, particle, u_fluid)
507+
deactivate_particle!(fluid_system, particle, v_fluid, u_fluid)
498508

499509
return fluid_system
500510
end

0 commit comments

Comments
 (0)