Currently, the WeaklyCompressibleSPHSystem, TotalLagrangianSPHSystem
and WallBoundarySystem support GPU execution.
We have tested GPU support on NVIDIA, AMD, and Apple GPUs.
Note that most Apple GPUs do not support Float64.
See [below on how to run single precision simulations](@ref single_precision).
To run a simulation on a GPU, use the FullGridCellList
as cell list for the GridNeighborhoodSearch.
Unlike the default cell list, which assumes an unbounded domain,
this cell list requires a bounding box for the domain.
For simulations that are bounded by a closed tank, we can simply use the boundary
of the tank to obtain the bounding box as follows.
min_corner = minimum(tank.boundary.coordinates, dims=2)
max_corner = maximum(tank.boundary.coordinates, dims=2)
cell_list = FullGridCellList(; min_corner, max_corner)
# output
FullGridCellList{PointNeighbors.DynamicVectorOfVectors{...}(...)
We then need to pass this cell list to the neighborhood search and the neighborhood search
to the Semidiscretization.
semi = Semidiscretization(fluid_system, boundary_system,
neighborhood_search=GridNeighborhoodSearch{2}(; cell_list))
# output
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Semidiscretization │
│ ══════════════════ │
│ #spatial dimensions: ………………………… 2 │
│ #systems: ……………………………………………………… 2 │
│ neighborhood search: ………………………… GridNeighborhoodSearch │
│ total #particles: ………………………………… 636 │
│ eltype: …………………………………………………………… Float64 │
│ coordinates eltype: …………………………… Float64 │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
At this point, run the simulation and make sure that it still works and that the bounding box is large enough. For some simulations where particles move outside the initial tank coordinates, for example when the tank is not closed or when the tank is moving, an appropriate bounding box has to be specified.
Then, we only need to specify the parallelization backend used for the simulation. On an NVIDIA GPU, we specify:
using CUDA
semi = Semidiscretization(fluid_system, boundary_system,
neighborhood_search=GridNeighborhoodSearch{2}(; cell_list),
parallelization_backend=CUDABackend())On an AMD GPU, we use:
using AMDGPU
semi = Semidiscretization(fluid_system, boundary_system,
neighborhood_search=GridNeighborhoodSearch{2}(; cell_list),
parallelization_backend=ROCBackend())Now, we can run the simulation as usual.
All data is transferred to the GPU during initialization and all loops over particles
and their neighbors are executed on the GPU as kernels generated by KernelAbstractions.jl.
Data is only copied to the CPU for saving VTK files via the SolutionSavingCallback.
The example file examples/fluid/dam_break_2d_gpu.jl demonstrates how to run an existing
example file on a GPU.
It first loads the variables from examples/fluid/dam_break_2d.jl without executing
the simulation. This is achieved by overwriting the line that starts the simulation
with trixi_include(..., sol=nothing).
Next, a GPU-compatible neighborhood search is defined, and the original example file
is included with the new neighborhood search.
This requires the assignments neighborhood_search = ... and parallelization_backend = ...
to be present in the original example file.
Note that in examples/fluid/dam_break_2d.jl, we explicitly set
parallelization_backend=PolyesterBackend(), even though this is the default value,
so that we can use trixi_include to replace this value.
To run this simulation on a GPU, simply update parallelization_backend to the backend
of the installed GPU. We can run this simulation on an NVIDIA GPU as follows.
using CUDA
trixi_include(joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"), parallelization_backend=CUDABackend())For AMD GPUs, use
using AMDGPU
trixi_include(joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"), parallelization_backend=ROCBackend())For Apple GPUs (which do not support double precision, see below), use
using Metal
trixi_include_changeprecision(Float32,
joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"),
parallelization_backend=MetalBackend(),
coordinates_eltype=Float32)All GPU-supported features can also be used with single precision, which is significantly faster on most GPUs and required for many Apple GPUs.
To run a simulation with single precision, all Float64 literals in an example file
must be converted to Float32 (e.g. 0.0 to 0.0f0).
TrixiParticles.jl provides a function to automate this conversion:
trixi_include_changeprecision
To run the previous example with single precision, use the following:
using CUDA
trixi_include_changeprecision(Float32,
joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"),
parallelization_backend=CUDABackend())Note that this simulation will now use Float32 everywhere, except for the
coordinates of the particles, which are still Float64 by default.
In simulations where the particle spacing is very small compared to the size of the domain,
floating point errors in the distance calculations can be large
relative to the particle spacing.
Using Float64 for the coordinates and distance calculations is crucial in such cases
to avoid artifacts in the simulation.
On most GPUs, the performance impact of using Float64 for the coordinates
is about 30% compared to using Float32 everywhere, which is still over 10x faster
than using Float64 everywhere.
On GPUs that do not support Float64, such as most Apple GPUs, we also need to set
the coordinates to Float32 by passing coordinates_eltype=Float32 to
the setup functions that create InitialConditions, such as
RectangularTank, RectangularShape, and SphereShape.