Skip to content

Commit c1487e5

Browse files
committed
FluidX3D v3.5 upgrade: multi-GPU particles
1 parent a18eb82 commit c1487e5

File tree

7 files changed

+136
-32
lines changed

7 files changed

+136
-32
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
6363
- made flag wireframe / solid surface visualization kernels toggleable with key <kbd>1</kbd>
6464
- added surface pressure visualization (key <kbd>1</kbd> when `FORCE_FIELD` is enabled and `lbm.calculate_force_on_boundaries();` is called)
6565
- added binary `.vtk` export function for meshes with `lbm.write_mesh_to_vtk(Mesh* mesh);`
66-
- added `time_step_multiplicator` for `integrate_particles()` function in PARTICLES extension
66+
- added `time_step_multiplicator` for `integrate_particles()` function in `PARTICLES` extension
6767
- made correction of wrong memory reporting on Intel Arc more robust
6868
- fixed bug in `write_file()` template functions
6969
- reverted back to separate `cl::Context` for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
@@ -236,6 +236,14 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
236236
- fixed bug in insertion-sort in `voxelize_mesh()` kernel causing crash on AMD GPUs
237237
- fixed bug in `voxelize_mesh_on_device()` host code causing initialization corruption on AMD GPUs
238238
- fixed dual CU and IPC reporting on AMD RDNA 1-4 GPUs
239+
- [v3.5](https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.5) (01.10.2025) [changes](https://github.com/ProjectPhysX/FluidX3D/compare/v3.4...v3.5) (multi-GPU particles)
240+
- `PARTICLES` extension now also works with multi-GPU
241+
- faster force spreading if volume force is axis-aligned
242+
- added more documentation for boundary conditions
243+
- updated FAQs
244+
- improved "hydraulic jump" sample setup
245+
- updated GPU driver install instructions
246+
- disabled zero-copy on ARM iGPUs because `CL_MEM_USE_HOST_PTR` is broken there
239247

240248
</details>
241249

@@ -447,7 +455,7 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem
447455
- optional [FP16S or FP16C compression](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) for thermal DDFs with [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
448456
- Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
449457
<p align="center"><i>&Pi;<sub>&alpha;&beta;</sub></i> = &Sigma;<sub><i>i</i></sub> <i>e<sub>i&alpha;</sub></i> <i>e<sub>i&beta;</sub></i> (<i>f<sub>i</sub></i> - <i>f<sub>i</sub></i><sup>eq-shifted</sup>)<br><br>Q = &Sigma;<sub><i>&alpha;&beta;</i></sub> <i>&Pi;<sub>&alpha;&beta;</sub></i><sup>2</sup><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;______________________<br>&tau; = &frac12; (&tau;<sub>0</sub> + &radic; &tau;<sub>0</sub><sup>2</sup> + <sup>(16&radic;2)</sup>&#8725;<sub>(<i>3&pi;</i><sup>2</sup>)</sub> <sup>&radic;Q</sup>&#8725;<sub><i>&rho;</i></sub> )</p>
450-
- particles with immersed-boundary method (either passive or 2-way-coupled, single-GPU only)
458+
- particles with immersed-boundary method (either passive or 2-way-coupled)
451459

452460
</details>
453461

@@ -474,7 +482,7 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem
474482

475483
## Solving the Compatibility Problem
476484

477-
- FluidX3D is written in OpenCL 1.2, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
485+
- FluidX3D is written in OpenCL, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
478486
- world's fastest datacenter GPUs: B200, MI300X, H200, H100 (NVL), A100, MI200, MI100, V100(S), GPU Max 1100, ...
479487
- gaming GPUs (desktop/laptop): Nvidia GeForce, AMD Radeon, Intel Arc
480488
- professional/workstation GPUs: Nvidia Quadro, AMD Radeon Pro / FirePro, Intel Arc Pro
@@ -1710,7 +1718,7 @@ Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly
17101718

17111719
- <details><summary>Does FluidX3D support adaptive mesh refinement?</summary><br>No, not yet. Grid cell size is the same everywhere in the simulation box.<br><br></details>
17121720

1713-
- <details><summary>Can FluidX3D model both water and air at the same time?</summary><br>No. FluidX3D can model either water or air, but not both at the same time. For free surface simulations with the <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#surface-extension">`SURFACE` extension</a>, I went with a <a href="https://doi.org/10.3390/computation10060092">volume-of-fluid</a>/<a href="https://doi.org/10.3390/computation10020021">PLIC</a> modeling approach as that provides a sharp water-air interface, so individual droplets can be resolved as small as 3 grid cells in diameter. However this model ignores the gas phase completely, and only models the fluid phase with LBM as well as the surface tension. An alternative I had explored years ago was the <a href="http://dx.doi.org/10.1016/j.jcp.2022.111753">phase-field models</a> (simplest of them is Shan-Chen model) - they model both fluid and gas phases, but struggle with the 1:1000 density contrast of air:water, and the modeled interface is diffuse over ~5 grid cells. So the smallest resolved droplets are ~10 grid cells in diameter, meaning for the same resolution you need ~37x the memory footprint - infeasible on GPUs. Coming back to VoF model, it is possible to <a href="http://dx.doi.org/10.1186/s43591-023-00053-7">extend it with a model for the gas phase</a>, but one has to manually track bubble split/merge events, which makes this approach very painful in implementation and poorly performing on the hardware.<br><br></details>
1721+
- <details><summary>Can FluidX3D model both water and air at the same time?</summary><br>No. FluidX3D can model either water or air, but not both at the same time. For free surface simulations with the <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#surface-extension">SURFACE extension</a>, I went with a <a href="https://doi.org/10.3390/computation10060092">volume-of-fluid</a>/<a href="https://doi.org/10.3390/computation10020021">PLIC</a> modeling approach as that provides a sharp water-air interface, so individual droplets can be resolved as small as 3 grid cells in diameter. However this model ignores the gas phase completely, and only models the fluid phase with LBM as well as the surface tension. An alternative I had explored years ago was the <a href="http://dx.doi.org/10.1016/j.jcp.2022.111753">phase-field models</a> (simplest of them is Shan-Chen model) - they model both fluid and gas phases, but struggle with the 1:1000 density contrast of air:water, and the modeled interface is diffuse over ~5 grid cells. So the smallest resolved droplets are ~10 grid cells in diameter, meaning for the same resolution you need ~37x the memory footprint - infeasible on GPUs. Coming back to VoF model, it is possible to <a href="http://dx.doi.org/10.1186/s43591-023-00053-7">extend it with a model for the gas phase</a>, but one has to manually track bubble split/merge events, which makes this approach very painful in implementation and poorly performing on the hardware.<br><br></details>
17141722

17151723
- <details><summary>Can FluidX3D compute lift/drag forces?</summary><br>Yes. See <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#liftdrag-forces">the relevant section in the FluidX3D Documentation</a>!<br><br></details>
17161724

src/info.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ void Info::print_logo() const {
4242
print("| "); print("\\ \\ / /", c); print(" |\n");
4343
print("| "); print("\\ ' /", c); print(" |\n");
4444
print("| "); print("\\ /", c); print(" |\n");
45-
print("| "); print("\\ /", c); print(" FluidX3D Version 3.4 |\n");
45+
print("| "); print("\\ /", c); print(" FluidX3D Version 3.5 |\n");
4646
print("| "); print( "'", c); print(" Copyright (c) Dr. Moritz Lehmann |\n");
4747
print("|-----------------------------------------------------------------------------|\n");
4848
}

src/kernel.cpp

Lines changed: 70 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -839,9 +839,9 @@ string opencl_c_container() { return R( // ########################## begin of O
839839
}
840840
)+R(float3 mirror_position(const float3 p) { // mirror position into periodic boundaries
841841
float3 r;
842-
r.x = sign(p.x)*(fmod(fabs(p.x)+0.5f*(float)def_Nx, (float)def_Nx)-0.5f*(float)def_Nx);
843-
r.y = sign(p.y)*(fmod(fabs(p.y)+0.5f*(float)def_Ny, (float)def_Ny)-0.5f*(float)def_Ny);
844-
r.z = sign(p.z)*(fmod(fabs(p.z)+0.5f*(float)def_Nz, (float)def_Nz)-0.5f*(float)def_Nz);
842+
r.x = sign(p.x)*(fmod(fabs(p.x)+0.5f*(float)def_GNx, (float)def_GNx)-0.5f*(float)def_GNx);
843+
r.y = sign(p.y)*(fmod(fabs(p.y)+0.5f*(float)def_GNy, (float)def_GNy)-0.5f*(float)def_GNy);
844+
r.z = sign(p.z)*(fmod(fabs(p.z)+0.5f*(float)def_GNz, (float)def_GNz)-0.5f*(float)def_GNz);
845845
return r;
846846
}
847847
)+R(float3 mirror_distance(const float3 d) { // mirror distance vector into periodic boundaries
@@ -1998,9 +1998,10 @@ string opencl_c_container() { return R( // ########################## begin of O
19981998
const uint x=(xb+i)%def_Nx, y=(yb+j)%def_Ny, z=(zb+k)%def_Nz; // calculate corner lattice positions
19991999
const uxx n = (uxx)x+(uxx)(y+z*def_Ny)*(uxx)def_Nx; // calculate lattice linear index
20002000
const float d = (1.0f-fabs(x1-(float)i))*(1.0f-fabs(y1-(float)j))*(1.0f-fabs(z1-(float)k)); // force spreading
2001-
atomic_add_f(&F[ n], Fn.x*d); // F[ n] += Fn.x*d;
2002-
atomic_add_f(&F[ def_N+(ulong)n], Fn.y*d); // F[ def_N+(ulong)n] += Fn.y*d;
2003-
atomic_add_f(&F[2ul*def_N+(ulong)n], Fn.z*d); // F[2ul*def_N+(ulong)n] += Fn.z*d;
2001+
const float3 Fnd = Fn*d;
2002+
if(Fnd.x!=0.0f) atomic_add_f(&F[ n], Fnd.x); // F[ n] += Fnd.x;
2003+
if(Fnd.y!=0.0f) atomic_add_f(&F[ def_N+(ulong)n], Fnd.y); // F[ def_N+(ulong)n] += Fnd.y;
2004+
if(Fnd.z!=0.0f) atomic_add_f(&F[2ul*def_N+(ulong)n], Fnd.z); // F[2ul*def_N+(ulong)n] += Fnd.z;
20042005
}
20052006
} // spread_force()
20062007
)+"#endif"+R( // FORCE_FIELD
@@ -2023,29 +2024,51 @@ string opencl_c_container() { return R( // ########################## begin of O
20232024
const float particle_radius = 0.5f; // has to be between 0.0f and 0.5f, default: 0.5f (hydrodynamic radius)
20242025
return boundary_distance-0.5f<particle_radius ? normalize(boundary_force) : (float3)(0.0f, 0.0f, 0.0f);
20252026
} // particle_boundary_force()
2026-
2027+
)+R(bool position_is_in_domain_including_halo(const float3 p) {
2028+
const float hNx = 0.5f*(float)(def_Nx-(def_Dx>1u)); // subtract half of halo still
2029+
const float hNy = 0.5f*(float)(def_Ny-(def_Dy>1u));
2030+
const float hNz = 0.5f*(float)(def_Nz-(def_Dz>1u));
2031+
return p.x>=-hNx&&p.x<hNx&&p.y>=-hNy&&p.y<hNy&&p.z>=-hNz&&p.z<hNz;
2032+
}
2033+
)+R(bool position_is_in_domain_excluding_halo(const float3 p) {
2034+
const float hNx = 0.5f*(float)(def_Nx-2u*(def_Dx>1u)); // subtract full halo
2035+
const float hNy = 0.5f*(float)(def_Ny-2u*(def_Dy>1u));
2036+
const float hNz = 0.5f*(float)(def_Nz-2u*(def_Dz>1u));
2037+
return p.x>=-hNx&&p.x<hNx&&p.y>=-hNy&&p.y<hNy&&p.z>=-hNz&&p.z<hNz;
2038+
}
20272039
)+R(kernel void integrate_particles)+"("+R(global float* particles, const global float* u, const global uchar* flags, const float time_step_multiplicator // ) {
20282040
)+"#ifdef FORCE_FIELD"+R(
20292041
, volatile global float* F, const float fx, const float fy, const float fz
20302042
)+"#endif"+R( // FORCE_FIELD
20312043
)+") {"+R( // integrate_particles()
20322044
const uxx n = get_global_id(0); // index of membrane points
20332045
if(n>=(uxx)def_particles_N) return;
2034-
const float3 p0 = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]); // cache particle position
2046+
float3 p = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]); // cache particle position
2047+
p = mirror_position(p); // mirror into global simulation box
2048+
p -= (float3)(def_domain_offset_x, def_domain_offset_y, def_domain_offset_z); // subtract domain offset, then treat point in local domain
2049+
if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_including_halo(p)) {
2050+
p.x = as_float(0xFFFFFFFFu); // invalidate x-coordinate for all particles outside of the local domain (including halo)
2051+
} else {
20352052
)+"#ifdef FORCE_FIELD"+R(
2036-
if(def_particles_rho!=1.0f) {
2037-
const float drho = def_particles_rho-1.0f; // density difference leads to particle buoyancy
2038-
float3 Fn = (float3)(fx*drho, fy*drho, fz*drho); // F = F_p+F_f = (m_p-m_f)*g = (rho_p-rho_f)*g*V
2039-
spread_force(F, p0, Fn); // do force spreading
2040-
}
2053+
if(def_particles_rho!=1.0f) { // apply volume force for all particles in local domain (including halo)
2054+
const float drho = def_particles_rho-1.0f; // density difference leads to particle buoyancy
2055+
float3 Fn = (float3)(fx*drho, fy*drho, fz*drho); // F = F_p+F_f = (m_p-m_f)*g = (rho_p-rho_f)*g*V
2056+
spread_force(F, p, Fn); // do force spreading
2057+
}
20412058
)+"#endif"+R( // FORCE_FIELD
2042-
const float3 p0_mirrored = mirror_position(p0);
2043-
float3 un = interpolate_u(p0_mirrored, u); // trilinear interpolation of velocity at point p
2044-
un = (un+length(un)*particle_boundary_force(p0_mirrored, flags))*time_step_multiplicator;
2045-
const float3 p = mirror_position(p0+un); // advect particles
2046-
particles[ n] = p.x;
2047-
particles[ def_particles_N+(ulong)n] = p.y;
2048-
particles[2ul*def_particles_N+(ulong)n] = p.z;
2059+
if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_excluding_halo(p)) { // skip remaining ghost particles in halo
2060+
p.x = as_float(0xFFFFFFFFu); // invalidate x-coordinate for all particles outside of the local domain
2061+
} else { // advect only particles in local domain (excluding halo)
2062+
float3 un = interpolate_u(p, u); // trilinear interpolation of velocity at point p
2063+
un = (un+length(un)*particle_boundary_force(p, flags))*time_step_multiplicator;
2064+
p += un; // advect particles
2065+
p += (float3)(def_domain_offset_x, def_domain_offset_y, def_domain_offset_z); // add domain offset, back to global domain
2066+
p = mirror_position(p); // mirror advected position again into global simulation box
2067+
particles[ def_particles_N+(ulong)n] = p.y; // store y/z-coordinates only for particles in domain
2068+
particles[2ul*def_particles_N+(ulong)n] = p.z;
2069+
}
2070+
}
2071+
particles[n] = p.x; // always store x-coordinate (invalidated or particles in domain)
20492072
} // integrate_particles()
20502073
)+"#endif"+R( // PARTICLES
20512074

@@ -2175,6 +2198,31 @@ string opencl_c_container() { return R( // ########################## begin of O
21752198
flags[index_insert_m(a, direction)] = transfer_buffer_m[a];
21762199
}
21772200

2201+
)+"#ifdef FORCE_FIELD"+R(
2202+
)+R(void extract_F(const uint a, const uint A, const uxx n, global float* transfer_buffer, const global float* F) {
2203+
transfer_buffer[ a] = F[ n];
2204+
transfer_buffer[ A+a] = F[ def_N+(ulong)n];
2205+
transfer_buffer[2u*A+a] = F[2ul*def_N+(ulong)n];
2206+
}
2207+
)+R(void insert_F(const uint a, const uint A, const uxx n, const global float* transfer_buffer, global float* F) {
2208+
F[ n] = transfer_buffer[ a];
2209+
F[ def_N+(ulong)n] = transfer_buffer[ A+a];
2210+
F[2ul*def_N+(ulong)n] = transfer_buffer[2u*A+a];
2211+
}
2212+
)+R(kernel void transfer_extract_F(const uint direction, const ulong t, global float* transfer_buffer_p, global float* transfer_buffer_m, const global float* F) {
2213+
const uint a=get_global_id(0), A=get_area(direction); // a = domain area index for each side, A = area of the domain boundary
2214+
if(a>=A) return; // area might not be a multiple of cl_workgroup_size, so return here to avoid writing in unallocated memory space
2215+
extract_F(a, A, index_extract_p(a, direction), transfer_buffer_p, F);
2216+
extract_F(a, A, index_extract_m(a, direction), transfer_buffer_m, F);
2217+
}
2218+
)+R(kernel void transfer__insert_F(const uint direction, const ulong t, const global float* transfer_buffer_p, const global float* transfer_buffer_m, global float* F) {
2219+
const uint a=get_global_id(0), A=get_area(direction); // a = domain area index for each side, A = area of the domain boundary
2220+
if(a>=A) return; // area might not be a multiple of cl_workgroup_size, so return here to avoid writing in unallocated memory space
2221+
insert_F(a, A, index_insert_p(a, direction), transfer_buffer_p, F);
2222+
insert_F(a, A, index_insert_m(a, direction), transfer_buffer_m, F);
2223+
}
2224+
)+"#endif"+R( // FORCE_FIELD
2225+
21782226
)+"#ifdef SURFACE"+R(
21792227
)+R(void extract_phi_massex_flags(const uint a, const uint A, const uxx n, global char* transfer_buffer, const global float* phi, const global float* massex, const global uchar* flags) {
21802228
((global float*)transfer_buffer)[ a] = phi [n];
@@ -2966,10 +3014,11 @@ string opencl_c_container() { return R( // ########################## begin of O
29663014
)+R(kernel void graphics_particles(const global float* camera, global int* bitmap, global int* zbuffer, const global float* particles) {
29673015
const uxx n = get_global_id(0);
29683016
if(n>=(uxx)def_particles_N) return;
3017+
const float3 p = (float3)(particles[n]-def_domain_offset_x, particles[def_particles_N+(ulong)n]-def_domain_offset_y, particles[2ul*def_particles_N+(ulong)n]-def_domain_offset_z);
3018+
if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_excluding_halo(p)) return;
29693019
float camera_cache[15]; // cache parameters in case the kernel draws more than one shape
29703020
for(uint i=0u; i<15u; i++) camera_cache[i] = camera[i];
29713021
const int c = COLOR_P; // coloring scheme
2972-
const float3 p = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]);
29733022
draw_point(p, c, camera_cache, bitmap, zbuffer);
29743023
//draw_circle(p, 0.5f, c, camera_cache, bitmap, zbuffer);
29753024
}

0 commit comments

Comments
 (0)