ProjectPhysX
diff --git a/‎README.md‎
Lines changed: 12 additions & 4 deletions b/‎README.md‎
Lines changed: 12 additions & 4 deletions
diff --git a/‎src/info.cpp‎
Lines changed: 1 addition & 1 deletion b/‎src/info.cpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/kernel.cpp‎
Lines changed: 70 additions & 21 deletions b/‎src/kernel.cpp‎
Lines changed: 70 additions & 21 deletions
@@ -63,7 +63,7 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
   - made flag wireframe / solid surface visualization kernels toggleable with key <kbd>1</kbd>
   - added surface pressure visualization (key <kbd>1</kbd> when `FORCE_FIELD` is enabled and `lbm.calculate_force_on_boundaries();` is called)
   - added binary `.vtk` export function for meshes with `lbm.write_mesh_to_vtk(Mesh* mesh);`
-  - added `time_step_multiplicator` for `integrate_particles()` function in PARTICLES extension
+  - added `time_step_multiplicator` for `integrate_particles()` function in `PARTICLES` extension
   - made correction of wrong memory reporting on Intel Arc more robust
   - fixed bug in `write_file()` template functions
   - reverted back to separate `cl::Context` for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
@@ -236,6 +236,14 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
   - fixed bug in insertion-sort in `voxelize_mesh()` kernel causing crash on AMD GPUs
   - fixed bug in `voxelize_mesh_on_device()` host code causing initialization corruption on AMD GPUs
   - fixed dual CU and IPC reporting on AMD RDNA 1-4 GPUs
+- [v3.5](https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.5) (01.10.2025) [changes](https://github.com/ProjectPhysX/FluidX3D/compare/v3.4...v3.5) (multi-GPU particles)
+  - `PARTICLES` extension now also works with multi-GPU
+  - faster force spreading if volume force is axis-aligned
+  - added more documentation for boundary conditions
+  - updated FAQs
+  - improved "hydraulic jump" sample setup
+  - updated GPU driver install instructions
+  - disabled zero-copy on ARM iGPUs because `CL_MEM_USE_HOST_PTR` is broken there
 
 </details>
 
@@ -447,7 +455,7 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem
     - optional [FP16S or FP16C compression](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) for thermal DDFs with [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
   - Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
     <p align="center"><i>&Pi;<sub>&alpha;&beta;</sub></i> = &Sigma;<sub><i>i</i></sub> <i>e<sub>i&alpha;</sub></i> <i>e<sub>i&beta;</sub></i> (<i>f<sub>i</sub></i>   - <i>f<sub>i</sub></i><sup>eq-shifted</sup>)<br><br>Q = &Sigma;<sub><i>&alpha;&beta;</i></sub>   <i>&Pi;<sub>&alpha;&beta;</sub></i><sup>2</sup><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;______________________<br>&tau; = &frac12; (&tau;<sub>0</sub> + &radic; &tau;<sub>0</sub><sup>2</sup> + <sup>(16&radic;2)</sup>&#8725;<sub>(<i>3&pi;</i><sup>2</sup>)</sub> <sup>&radic;Q</sup>&#8725;<sub><i>&rho;</i></sub> )</p>
-  - particles with immersed-boundary method (either passive or 2-way-coupled, single-GPU only)
+  - particles with immersed-boundary method (either passive or 2-way-coupled)
 
   </details>
 
@@ -474,7 +482,7 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem
 
 ## Solving the Compatibility Problem
 
-- FluidX3D is written in OpenCL 1.2, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
+- FluidX3D is written in OpenCL, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
   - world's fastest datacenter GPUs: B200, MI300X, H200, H100 (NVL), A100, MI200, MI100, V100(S), GPU Max 1100, ...
   - gaming GPUs (desktop/laptop): Nvidia GeForce, AMD Radeon, Intel Arc
   - professional/workstation GPUs: Nvidia Quadro, AMD Radeon Pro / FirePro, Intel Arc Pro
@@ -1710,7 +1718,7 @@ Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly
 
 - <details><summary>Does FluidX3D support adaptive mesh refinement?</summary><br>No, not yet. Grid cell size is the same everywhere in the simulation box.<br><br></details>
 
-- <details><summary>Can FluidX3D model both water and air at the same time?</summary><br>No. FluidX3D can model either water or air, but not both at the same time. For free surface simulations with the <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#surface-extension">`SURFACE` extension</a>, I went with a <a href="https://doi.org/10.3390/computation10060092">volume-of-fluid</a>/<a href="https://doi.org/10.3390/computation10020021">PLIC</a> modeling approach as that provides a sharp water-air interface, so individual droplets can be resolved as small as 3 grid cells in diameter. However this model ignores the gas phase completely, and only models the fluid phase with LBM as well as the surface tension. An alternative I had explored years ago was the <a href="http://dx.doi.org/10.1016/j.jcp.2022.111753">phase-field models</a> (simplest of them is Shan-Chen model) - they model both fluid and gas phases, but struggle with the 1:1000 density contrast of air:water, and the modeled interface is diffuse over ~5 grid cells. So the smallest resolved droplets are ~10 grid cells in diameter, meaning for the same resolution you need ~37x the memory footprint - infeasible on GPUs. Coming back to VoF model, it is possible to <a href="http://dx.doi.org/10.1186/s43591-023-00053-7">extend it with a model for the gas phase</a>, but one has to manually track bubble split/merge events, which makes this approach very painful in implementation and poorly performing on the hardware.<br><br></details>
+- <details><summary>Can FluidX3D model both water and air at the same time?</summary><br>No. FluidX3D can model either water or air, but not both at the same time. For free surface simulations with the <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#surface-extension">SURFACE extension</a>, I went with a <a href="https://doi.org/10.3390/computation10060092">volume-of-fluid</a>/<a href="https://doi.org/10.3390/computation10020021">PLIC</a> modeling approach as that provides a sharp water-air interface, so individual droplets can be resolved as small as 3 grid cells in diameter. However this model ignores the gas phase completely, and only models the fluid phase with LBM as well as the surface tension. An alternative I had explored years ago was the <a href="http://dx.doi.org/10.1016/j.jcp.2022.111753">phase-field models</a> (simplest of them is Shan-Chen model) - they model both fluid and gas phases, but struggle with the 1:1000 density contrast of air:water, and the modeled interface is diffuse over ~5 grid cells. So the smallest resolved droplets are ~10 grid cells in diameter, meaning for the same resolution you need ~37x the memory footprint - infeasible on GPUs. Coming back to VoF model, it is possible to <a href="http://dx.doi.org/10.1186/s43591-023-00053-7">extend it with a model for the gas phase</a>, but one has to manually track bubble split/merge events, which makes this approach very painful in implementation and poorly performing on the hardware.<br><br></details>
 
 - <details><summary>Can FluidX3D compute lift/drag forces?</summary><br>Yes. See <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/DOCUMENTATION.md#liftdrag-forces">the relevant section in the FluidX3D Documentation</a>!<br><br></details>
 
 
@@ -42,7 +42,7 @@ void Info::print_logo() const {
 	print("|                                  ");                 print("\\  \\ /  /", c);                print("                                  |\n");
 	print("|                                   ");                 print("\\  '  /", c);                 print("                                   |\n");
 	print("|                                    ");                 print("\\   /", c);                 print("                                    |\n");
-	print("|                                     ");                 print("\\ /", c);                 print("                FluidX3D Version 3.4 |\n");
+	print("|                                     ");                 print("\\ /", c);                 print("                FluidX3D Version 3.5 |\n");
 	print("|                                      ");                 print( "'", c);                 print("     Copyright (c) Dr. Moritz Lehmann |\n");
 	print("|-----------------------------------------------------------------------------|\n");
 }
 
@@ -839,9 +839,9 @@ string opencl_c_container() { return R( // ########################## begin of O
 }
 )+R(float3 mirror_position(const float3 p) { // mirror position into periodic boundaries
 	float3 r;
-	r.x = sign(p.x)*(fmod(fabs(p.x)+0.5f*(float)def_Nx, (float)def_Nx)-0.5f*(float)def_Nx);
-	r.y = sign(p.y)*(fmod(fabs(p.y)+0.5f*(float)def_Ny, (float)def_Ny)-0.5f*(float)def_Ny);
-	r.z = sign(p.z)*(fmod(fabs(p.z)+0.5f*(float)def_Nz, (float)def_Nz)-0.5f*(float)def_Nz);
+	r.x = sign(p.x)*(fmod(fabs(p.x)+0.5f*(float)def_GNx, (float)def_GNx)-0.5f*(float)def_GNx);
+	r.y = sign(p.y)*(fmod(fabs(p.y)+0.5f*(float)def_GNy, (float)def_GNy)-0.5f*(float)def_GNy);
+	r.z = sign(p.z)*(fmod(fabs(p.z)+0.5f*(float)def_GNz, (float)def_GNz)-0.5f*(float)def_GNz);
 	return r;
 }
 )+R(float3 mirror_distance(const float3 d) { // mirror distance vector into periodic boundaries
@@ -1998,9 +1998,10 @@ string opencl_c_container() { return R( // ########################## begin of O
 		const uint x=(xb+i)%def_Nx, y=(yb+j)%def_Ny, z=(zb+k)%def_Nz; // calculate corner lattice positions
 		const uxx n = (uxx)x+(uxx)(y+z*def_Ny)*(uxx)def_Nx; // calculate lattice linear index
 		const float d = (1.0f-fabs(x1-(float)i))*(1.0f-fabs(y1-(float)j))*(1.0f-fabs(z1-(float)k)); // force spreading
-		atomic_add_f(&F[                 n], Fn.x*d); // F[                 n] += Fn.x*d;
-		atomic_add_f(&F[    def_N+(ulong)n], Fn.y*d); // F[    def_N+(ulong)n] += Fn.y*d;
-		atomic_add_f(&F[2ul*def_N+(ulong)n], Fn.z*d); // F[2ul*def_N+(ulong)n] += Fn.z*d;
+		const float3 Fnd = Fn*d;
+		if(Fnd.x!=0.0f) atomic_add_f(&F[                 n], Fnd.x); // F[                 n] += Fnd.x;
+		if(Fnd.y!=0.0f) atomic_add_f(&F[    def_N+(ulong)n], Fnd.y); // F[    def_N+(ulong)n] += Fnd.y;
+		if(Fnd.z!=0.0f) atomic_add_f(&F[2ul*def_N+(ulong)n], Fnd.z); // F[2ul*def_N+(ulong)n] += Fnd.z;
 	}
 } // spread_force()
 )+"#endif"+R( // FORCE_FIELD
@@ -2023,29 +2024,51 @@ string opencl_c_container() { return R( // ########################## begin of O
 	const float particle_radius = 0.5f; // has to be between 0.0f and 0.5f, default: 0.5f (hydrodynamic radius)
 	return boundary_distance-0.5f<particle_radius ? normalize(boundary_force) : (float3)(0.0f, 0.0f, 0.0f);
 } // particle_boundary_force()
-
+)+R(bool position_is_in_domain_including_halo(const float3 p) {
+	const float hNx = 0.5f*(float)(def_Nx-(def_Dx>1u)); // subtract half of halo still
+	const float hNy = 0.5f*(float)(def_Ny-(def_Dy>1u));
+	const float hNz = 0.5f*(float)(def_Nz-(def_Dz>1u));
+	return p.x>=-hNx&&p.x<hNx&&p.y>=-hNy&&p.y<hNy&&p.z>=-hNz&&p.z<hNz;
+}
+)+R(bool position_is_in_domain_excluding_halo(const float3 p) {
+	const float hNx = 0.5f*(float)(def_Nx-2u*(def_Dx>1u)); // subtract full halo
+	const float hNy = 0.5f*(float)(def_Ny-2u*(def_Dy>1u));
+	const float hNz = 0.5f*(float)(def_Nz-2u*(def_Dz>1u));
+	return p.x>=-hNx&&p.x<hNx&&p.y>=-hNy&&p.y<hNy&&p.z>=-hNz&&p.z<hNz;
+}
 )+R(kernel void integrate_particles)+"("+R(global float* particles, const global float* u, const global uchar* flags, const float time_step_multiplicator // ) {
 )+"#ifdef FORCE_FIELD"+R(
 	, volatile global float* F, const float fx, const float fy, const float fz
 )+"#endif"+R( // FORCE_FIELD
 )+") {"+R( // integrate_particles()
 	const uxx n = get_global_id(0); // index of membrane points
 	if(n>=(uxx)def_particles_N) return;
-	const float3 p0 = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]); // cache particle position
+	float3 p = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]); // cache particle position
+	p = mirror_position(p); // mirror into global simulation box
+	p -= (float3)(def_domain_offset_x, def_domain_offset_y, def_domain_offset_z); // subtract domain offset, then treat point in local domain
+	if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_including_halo(p)) {
+		p.x = as_float(0xFFFFFFFFu); // invalidate x-coordinate for all particles outside of the local domain (including halo)
+	} else {
 )+"#ifdef FORCE_FIELD"+R(
-	if(def_particles_rho!=1.0f) {
-		const float drho = def_particles_rho-1.0f; // density difference leads to particle buoyancy
-		float3 Fn = (float3)(fx*drho, fy*drho, fz*drho); // F = F_p+F_f = (m_p-m_f)*g = (rho_p-rho_f)*g*V
-		spread_force(F, p0, Fn); // do force spreading
-	}
+		if(def_particles_rho!=1.0f) { // apply volume force for all particles in local domain (including halo)
+			const float drho = def_particles_rho-1.0f; // density difference leads to particle buoyancy
+			float3 Fn = (float3)(fx*drho, fy*drho, fz*drho); // F = F_p+F_f = (m_p-m_f)*g = (rho_p-rho_f)*g*V
+			spread_force(F, p, Fn); // do force spreading
+		}
 )+"#endif"+R( // FORCE_FIELD
-	const float3 p0_mirrored = mirror_position(p0);
-	float3 un = interpolate_u(p0_mirrored, u); // trilinear interpolation of velocity at point p
-	un = (un+length(un)*particle_boundary_force(p0_mirrored, flags))*time_step_multiplicator;
-	const float3 p = mirror_position(p0+un); // advect particles
-	particles[                           n] = p.x;
-	particles[    def_particles_N+(ulong)n] = p.y;
-	particles[2ul*def_particles_N+(ulong)n] = p.z;
+		if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_excluding_halo(p)) { // skip remaining ghost particles in halo
+			p.x = as_float(0xFFFFFFFFu); // invalidate x-coordinate for all particles outside of the local domain
+		} else { // advect only particles in local domain (excluding halo)
+			float3 un = interpolate_u(p, u); // trilinear interpolation of velocity at point p
+			un = (un+length(un)*particle_boundary_force(p, flags))*time_step_multiplicator;
+			p += un; // advect particles
+			p += (float3)(def_domain_offset_x, def_domain_offset_y, def_domain_offset_z); // add domain offset, back to global domain
+			p = mirror_position(p); // mirror advected position again into global simulation box
+			particles[    def_particles_N+(ulong)n] = p.y; // store y/z-coordinates only for particles in domain
+			particles[2ul*def_particles_N+(ulong)n] = p.z;
+		}
+	}
+	particles[n] = p.x; // always store x-coordinate (invalidated or particles in domain)
 } // integrate_particles()
 )+"#endif"+R( // PARTICLES
 
@@ -2175,6 +2198,31 @@ string opencl_c_container() { return R( // ########################## begin of O
 	flags[index_insert_m(a, direction)] = transfer_buffer_m[a];
 }
 
+)+"#ifdef FORCE_FIELD"+R(
+)+R(void extract_F(const uint a, const uint A, const uxx n, global float* transfer_buffer, const global float* F) {
+	transfer_buffer[     a] = F[                 n];
+	transfer_buffer[   A+a] = F[    def_N+(ulong)n];
+	transfer_buffer[2u*A+a] = F[2ul*def_N+(ulong)n];
+}
+)+R(void insert_F(const uint a, const uint A, const uxx n, const global float* transfer_buffer, global float* F) {
+	F[                 n] = transfer_buffer[     a];
+	F[    def_N+(ulong)n] = transfer_buffer[   A+a];
+	F[2ul*def_N+(ulong)n] = transfer_buffer[2u*A+a];
+}
+)+R(kernel void transfer_extract_F(const uint direction, const ulong t, global float* transfer_buffer_p, global float* transfer_buffer_m, const global float* F) {
+	const uint a=get_global_id(0), A=get_area(direction); // a = domain area index for each side, A = area of the domain boundary
+	if(a>=A) return; // area might not be a multiple of cl_workgroup_size, so return here to avoid writing in unallocated memory space
+	extract_F(a, A, index_extract_p(a, direction), transfer_buffer_p, F);
+	extract_F(a, A, index_extract_m(a, direction), transfer_buffer_m, F);
+}
+)+R(kernel void transfer__insert_F(const uint direction, const ulong t, const global float* transfer_buffer_p, const global float* transfer_buffer_m, global float* F) {
+	const uint a=get_global_id(0), A=get_area(direction); // a = domain area index for each side, A = area of the domain boundary
+	if(a>=A) return; // area might not be a multiple of cl_workgroup_size, so return here to avoid writing in unallocated memory space
+	insert_F(a, A, index_insert_p(a, direction), transfer_buffer_p, F);
+	insert_F(a, A, index_insert_m(a, direction), transfer_buffer_m, F);
+}
+)+"#endif"+R( // FORCE_FIELD
+
 )+"#ifdef SURFACE"+R(
 )+R(void extract_phi_massex_flags(const uint a, const uint A, const uxx n, global char* transfer_buffer, const global float* phi, const global float* massex, const global uchar* flags) {
 	((global float*)transfer_buffer)[     a] = phi   [n];
@@ -2966,10 +3014,11 @@ string opencl_c_container() { return R( // ########################## begin of O
 )+R(kernel void graphics_particles(const global float* camera, global int* bitmap, global int* zbuffer, const global float* particles) {
 	const uxx n = get_global_id(0);
 	if(n>=(uxx)def_particles_N) return;
+	const float3 p = (float3)(particles[n]-def_domain_offset_x, particles[def_particles_N+(ulong)n]-def_domain_offset_y, particles[2ul*def_particles_N+(ulong)n]-def_domain_offset_z);
+	if(def_Dx*def_Dy*def_Dz>1u&&!position_is_in_domain_excluding_halo(p)) return;
 	float camera_cache[15]; // cache parameters in case the kernel draws more than one shape
 	for(uint i=0u; i<15u; i++) camera_cache[i] = camera[i];
 	const int c = COLOR_P; // coloring scheme
-	const float3 p = (float3)(particles[n], particles[def_particles_N+(ulong)n], particles[2ul*def_particles_N+(ulong)n]);
 	draw_point(p, c, camera_cache, bitmap, zbuffer);
 	//draw_circle(p, 0.5f, c, camera_cache, bitmap, zbuffer);
 }
Original file line number	Diff line number	Diff line change
`@@ -42,7 +42,7 @@ void Info::print_logo() const {`
`42`	`42`	`print("\| "); print("\\ \\ / /", c); print(" \|\n");`
`43`	`43`	`print("\| "); print("\\ ' /", c); print(" \|\n");`
`44`	`44`	`print("\| "); print("\\ /", c); print(" \|\n");`
`45`		`- print("\| "); print("\\ /", c); print(" FluidX3D Version 3.4 \|\n");`
	`45`	`+ print("\| "); print("\\ /", c); print(" FluidX3D Version 3.5 \|\n");`
`46`	`46`	`print("\| "); print( "'", c); print(" Copyright (c) Dr. Moritz Lehmann \|\n");`
`47`	`47`	`print("\|-----------------------------------------------------------------------------\|\n");`
`48`	`48`	`}`