Skip to content

Commit 9f92972

Browse files
committed
src/cuda/: scaffold native CUDA backend (skeleton + work plan)
Lays out the structure for a native CUDA backend that mirrors PR #222's native Metal port. The actual kernel ports (translating sphFluid.cl's 1515 lines to sphFluid.cu CUDA C++) are deferred — they're ~2 weeks of focused work and depend on PR #222's owSolver abstract base landing first to avoid a refactor. What this commit provides: src/cuda/README.md Implementation plan (5 phases, ~2 weeks estimated, with file-by-file mapping to PR #222's Metal port for structural cribbing) src/cuda/sphFluid.cu Skeleton with __global__ kernel signatures (TODO bodies). One per OpenCL kernel in sphFluid.cl. Includes an explicit reminder that pcisphComputeElasticForces must keep elastic forces in WORLD coordinates (this is where the Taichi pancake bug originates; CUDA port should match OpenCL exactly). inc/owCudaSolver.h Public C++ interface mirroring owOpenCLSolver.h. Method signatures commented out so this header doesn't try to declare functions whose definitions don't exist yet. What this commit does NOT do: - Touch makefile, makefile.OSX — they don't reference these files, existing builds are unaffected. - Add backend=cuda to owConfigProperty — would require a real implementation to dispatch to. - Attempt actual CUDA kernel implementation — that's Phase 1 of the work plan in src/cuda/README.md. Reasoning: this gives the next developer (or a future LLM session) a clear target architecture without committing to a half-baked port. When PR #222 lands, the CUDA work can proceed against the same owSolver abstract base + src/kernels/ descriptor pattern that PR #222 introduces, rather than being structurally divergent from the Metal half.
1 parent 92183bd commit 9f92972

3 files changed

Lines changed: 287 additions & 0 deletions

File tree

inc/owCudaSolver.h

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
// owCudaSolver — native CUDA backend for Sibernetic's PCISPH solver.
2+
//
3+
// STATUS: skeleton only. Public interface defined; method bodies are
4+
// declared but not implemented. See ../src/cuda/README.md for the
5+
// implementation plan.
6+
//
7+
// Mirrors owOpenCLSolver.h's interface exactly so owPhysicsFluidSimulator
8+
// can dispatch between OpenCL, Metal (PR #222), and CUDA via the same
9+
// base class. Once PR #222's owSolver.h lands, this should inherit from
10+
// owSolver and override the virtual methods rather than duplicating the
11+
// signatures.
12+
13+
#ifndef OWCUDASOLVER_H_
14+
#define OWCUDASOLVER_H_
15+
16+
#include "owConfigProperty.h"
17+
18+
class owCudaSolver {
19+
public:
20+
// TODO: constructors mirroring owOpenCLSolver
21+
// owCudaSolver(float *positions, float *velocities, owConfigProperty *config,
22+
// float *connections = nullptr, float *membranes = nullptr,
23+
// int *particleMembranes = nullptr);
24+
// ~owCudaSolver();
25+
26+
// ── Neighbor search ──
27+
// unsigned int _runClearBuffers(owConfigProperty *config);
28+
// unsigned int _runHashParticles(owConfigProperty *config);
29+
// void _runSort(owConfigProperty *config);
30+
// unsigned int _runSortPostPass(owConfigProperty *config);
31+
// unsigned int _runIndexx(owConfigProperty *config);
32+
// void _runIndexPostPass(owConfigProperty *config);
33+
// unsigned int _runFindNeighbors(owConfigProperty *config);
34+
35+
// ── PCISPH physics ──
36+
// unsigned int _run_pcisph_computeDensity(owConfigProperty *config);
37+
// unsigned int _run_pcisph_computeForcesAndInitPressure(owConfigProperty *config);
38+
// unsigned int _run_pcisph_computeElasticForces(owConfigProperty *config);
39+
// unsigned int _run_pcisph_predictPositions(owConfigProperty *config);
40+
// unsigned int _run_pcisph_predictDensity(owConfigProperty *config);
41+
// unsigned int _run_pcisph_correctPressure(owConfigProperty *config);
42+
// unsigned int _run_pcisph_computePressureForceAcceleration(owConfigProperty *config);
43+
// unsigned int _run_pcisph_integrate(int iterationCount,
44+
// int pcisph_integrate_mode,
45+
// owConfigProperty *config);
46+
47+
// ── Membrane interaction ──
48+
// unsigned int _run_clearMembraneBuffers(owConfigProperty *config);
49+
// unsigned int _run_computeInteractionWithMembranes(owConfigProperty *config);
50+
51+
// ── Buffer access (mirrors owOpenCLSolver pattern) ──
52+
// void read_position_buffer(float *positions, owConfigProperty *config);
53+
// void read_velocity_buffer(float *velocities, owConfigProperty *config);
54+
// void read_density_buffer(float *density, owConfigProperty *config);
55+
56+
private:
57+
// TODO: CUDA stream, device pointers for each buffer (positions,
58+
// velocities, accelerations, density, pressure, neighbor map, grid
59+
// index, etc.). See PR #222's owMetalSolver.cpp for the equivalent
60+
// Metal implementation as a structural template.
61+
};
62+
63+
#endif // OWCUDASOLVER_H_

src/cuda/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Sibernetic CUDA backend (skeleton)
2+
3+
> **Status: scaffolding only.** Files in this directory and `inc/owCudaSolver.h` lay out the structure for a native CUDA backend that mirrors PR #222's native Metal port. The actual CUDA kernel ports (translating `src/sphFluid.cl` to `src/cuda/sphFluid.cu`) are deferred — they are weeks of focused work and depend on PR #222's `owSolver` abstract base landing first.
4+
5+
## Why this exists
6+
7+
The `ow-native-gpu-0.1.0` line is built on the strategic decision (see `DEVELOPMENT_LOG.md`) to use **two vendor-backed GPU backends** instead of relying on cross-platform Python compilers like Taichi (whose maintenance has slowed) or PyTorch (whose per-kernel-launch overhead makes it 21× slower than OpenCL on the same hardware):
8+
9+
| Platform | Backend | Status |
10+
|---|---|---|
11+
| Apple Silicon | native Metal (PR #222 by Wei Weng) | merged-pending in `weiweng/modernize-makefile-osx` |
12+
| NVIDIA | **native CUDA (this directory)** | **scaffolded; not implemented** |
13+
| Linux server | OpenCL via NVIDIA runtime (existing) | parity baseline; do not invest |
14+
| Cross-platform fallback | Taichi-CUDA / Taichi-Metal (existing) | bug-blocked; needs `taichi_solver.py` fix |
15+
16+
## Structure (mirrors PR #222 / Metal)
17+
18+
```
19+
src/cuda/
20+
├── README.md ← you are here
21+
├── sphFluid.cu ← all CUDA __global__ kernels (port of src/sphFluid.cl)
22+
├── CudaContext.cpp/h ← CUDA device init, stream, memory pools (TODO)
23+
└── kernels/ ← one .cuh per kernel descriptor (mirrors PR #222's src/kernels/) (TODO)
24+
25+
inc/
26+
└── owCudaSolver.h ← public C++ interface, mirrors owOpenCLSolver
27+
28+
src/
29+
├── owCudaSolver.cpp ← bridge from owSolver virtual interface to .cu kernels (TODO)
30+
└── backend/
31+
└── CudaBackend.cpp/h ← (TODO) CUDA runtime API wrapper, equivalent to PR #222's MetalBackend
32+
```
33+
34+
## Implementation plan
35+
36+
Sequenced so each step produces a working artifact:
37+
38+
### Phase 0: Wait for PR #222 to land
39+
Reason: PR #222 introduces `inc/owSolver.h` (the abstract base both backends implement) and the `src/kernels/` descriptor pattern. Building the CUDA backend on the pre-PR-#222 OpenCL-only structure means refactoring once #222 lands. Wait until #222 merges.
40+
41+
### Phase 1: Port `sphFluid.cl``sphFluid.cu` (literal translation)
42+
Translate every OpenCL kernel in `src/sphFluid.cl` (1515 lines) to CUDA. Mostly mechanical:
43+
- `__kernel void``__global__ void`
44+
- `__global float4 *buf``float4 *buf` with explicit pointer args
45+
- `get_global_id(0)``blockIdx.x * blockDim.x + threadIdx.x`
46+
- `barrier(CLK_LOCAL_MEM_FENCE)``__syncthreads()`
47+
- `__local``__shared__`
48+
- OpenCL math intrinsics → CUDA intrinsics (`fabs``fabsf`, etc.)
49+
50+
Estimated: 2-3 days for a careful translation, with a parity test against the OpenCL output at each kernel.
51+
52+
### Phase 2: Implement `CudaBackend.cpp` (host-side dispatch)
53+
Wraps cuBLAS-style CUDA runtime calls:
54+
- `cudaMalloc` / `cudaFree` for buffers
55+
- `cudaMemcpy` for host-device transfers
56+
- `<<<grid, block>>>` kernel launches
57+
- `cudaStreamSynchronize` for ordering
58+
59+
Estimated: 2-3 days.
60+
61+
### Phase 3: Implement `owCudaSolver.cpp`
62+
Bridge between `owSolver` virtual interface (PR #222's abstraction) and `CudaBackend.cpp`'s kernel launches. Mirrors `owMetalSolver.cpp` from PR #222 line-by-line.
63+
64+
Estimated: 1-2 days.
65+
66+
### Phase 4: Wire into the build
67+
- Add `nvcc` to the Linux makefile path
68+
- Add CUDA backend selection to `owConfigProperty.cpp` (`backend=cuda`)
69+
- Update `Dockerfile` for sibernetic-runner to install CUDA toolkit (already has CUDA runtime via nvidia/cuda image)
70+
- Add `backend=cuda` to the cross-backend regression script
71+
72+
Estimated: 1 day.
73+
74+
### Phase 5: Cross-backend parity validation
75+
Run `scripts/cross_backend_regression.py --backend cuda --backend opencl --local-binary <PR222-Metal>`. All three should produce demo1 cube-stability metrics within the existing tolerance bands (extent retention ≥ 80%, mean_y fell ≥ 50%).
76+
77+
Estimated: 1 day of measurement + tuning.
78+
79+
### Total estimated effort: ~2 weeks of focused work for a competent CUDA developer.
80+
81+
## Reference files in PR #222 to model from
82+
83+
When PR #222 lands, the matching CUDA files would mirror these structurally:
84+
85+
| Metal file | CUDA equivalent |
86+
|---|---|
87+
| `inc/owMetalSolver.h` | `inc/owCudaSolver.h` |
88+
| `src/owMetalSolver.cpp` | `src/owCudaSolver.cpp` |
89+
| `src/owMetalPrivateImpl.cpp` | `src/owCudaPrivateImpl.cpp` (if needed) |
90+
| `src/backend/MetalBackend.{cpp,h}` | `src/backend/CudaBackend.{cpp,h}` |
91+
| `src/metal/sphFluid.metal` | `src/cuda/sphFluid.cu` |
92+
| `src/kernels/*.h` | `src/kernels/*.h` (already shared with Metal — same descriptors) |
93+
94+
The Metal/CUDA divergence is **only** in the kernel language (MSL vs CUDA C++) and the host-side runtime API (Metal C++ vs CUDA Runtime). The algorithm specification, kernel descriptors, and abstract solver interface are shared.
95+
96+
## Why not just use Taichi-CUDA?
97+
98+
We tested it on 2026-05-03:
99+
- 1-sec demo1 sim: cube didn't move at all (mean_y unchanged 44.42 → 44.42)
100+
- 5-sec demo1 sim on Apple Silicon Taichi-Metal: cube pancaked (extent retention 7.6%)
101+
- Both manifest the same algorithmic bug in `taichi_solver.py` (the documented coordinate-scale issue)
102+
103+
Fixing `taichi_solver.py` would unblock Taichi as a backend for both Metal and CUDA. Until then, native CUDA is the path forward for NVIDIA hardware.
104+
105+
## Why not just use OpenCL on NVIDIA?
106+
107+
It actually works fine — Cloud Run + L4 + NVIDIA's OpenCL runtime measures at 86 sec for a 1-sec demo1 sim, with cube physics intact. However:
108+
- Apple killed OpenCL on Apple Silicon, so it's not a path forward for cross-platform dev
109+
- The 2015 AMD APP SDK we historically link against is abandoned
110+
- NVIDIA's OpenCL is still maintained but not actively invested in
111+
- For long-term maintainability we want vendor-backed APIs (CUDA on NVIDIA, Metal on Apple)
112+
113+
OpenCL on NVIDIA stays as the **parity baseline** in the cross-backend regression: when we add the native CUDA backend, its outputs must match OpenCL within tolerance.

src/cuda/sphFluid.cu

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
// CUDA kernel scaffolding for Sibernetic's PCISPH solver.
2+
//
3+
// STATUS: skeleton only. Function signatures and a high-level work plan
4+
// are present; the actual kernel bodies are TODO. See ./README.md for
5+
// the implementation phases.
6+
//
7+
// Each kernel here should be a literal port of its OpenCL counterpart in
8+
// src/sphFluid.cl. Algorithm-level changes are out of scope — the goal
9+
// is for `backend=cuda` to produce position/velocity output matching
10+
// `backend=opencl` within fp32 noise on the cross-backend regression.
11+
12+
#ifndef SIBERNETIC_CUDA_SPHFLUID_CU
13+
#define SIBERNETIC_CUDA_SPHFLUID_CU
14+
15+
#include <cuda_runtime.h>
16+
17+
// -----------------------------------------------------------------------
18+
// Neighbor search
19+
// -----------------------------------------------------------------------
20+
21+
__global__ void clearBuffers(/* TODO: signature mirroring sphFluid.cl */) {
22+
// TODO: port from src/sphFluid.cl ::clearBuffers
23+
}
24+
25+
__global__ void hashParticles(/* TODO */) {
26+
// TODO: port from src/sphFluid.cl ::hashParticles
27+
// OpenCL: global_id → particle index, compute cell, write to gridCellIndex
28+
}
29+
30+
// Sort: in CUDA we'd use thrust::sort_by_key on (cellId, particleIndex).
31+
// No __global__ kernel needed; called from the host in CudaBackend.cpp.
32+
33+
__global__ void sortPostPass(/* TODO */) {
34+
// TODO: port from src/sphFluid.cl ::sortPostPass
35+
}
36+
37+
__global__ void indexx(/* TODO */) {
38+
// TODO: port from src/sphFluid.cl ::indexx
39+
}
40+
41+
__global__ void indexPostPass(/* TODO */) {
42+
// TODO: port from src/sphFluid.cl ::indexPostPass
43+
}
44+
45+
__global__ void findNeighbors(/* TODO */) {
46+
// TODO: port from src/sphFluid.cl ::findNeighbors
47+
// 27-cell neighborhood walk; build per-particle neighbor list.
48+
}
49+
50+
// -----------------------------------------------------------------------
51+
// PCISPH physics
52+
// -----------------------------------------------------------------------
53+
54+
__global__ void pcisphComputeDensity(/* TODO */) {
55+
// TODO: port from src/sphFluid.cl ::pcisph_computeDensity
56+
// Wpoly6 kernel, density estimation per particle.
57+
}
58+
59+
__global__ void pcisphComputeForcesAndInitPressure(/* TODO */) {
60+
// TODO: port from src/sphFluid.cl ::pcisph_computeForcesAndInitPressure
61+
// Viscosity, surface tension, gravity, body forces.
62+
}
63+
64+
__global__ void pcisphComputeElasticForces(/* TODO */) {
65+
// TODO: port from src/sphFluid.cl ::pcisph_computeElasticForces
66+
// Spring forces between elastic-bonded particles.
67+
// CRITICAL: this is the kernel where Taichi's coordinate-scale bug
68+
// manifests. CUDA port should match OpenCL's coordinate handling
69+
// exactly — keep elastic forces in WORLD coordinates, not scaled.
70+
}
71+
72+
__global__ void pcisphPredictPositions(/* TODO */) {
73+
// TODO: port from src/sphFluid.cl ::pcisph_predictPositions
74+
// PCISPH iteration step 1: predict positions under current acceleration.
75+
}
76+
77+
__global__ void pcisphPredictDensity(/* TODO */) {
78+
// TODO: port from src/sphFluid.cl ::pcisph_predictDensity
79+
// PCISPH iteration step 2: re-evaluate density at predicted positions.
80+
}
81+
82+
__global__ void pcisphCorrectPressure(/* TODO */) {
83+
// TODO: port from src/sphFluid.cl ::pcisph_correctPressure
84+
// PCISPH iteration step 3: update pressure field to enforce
85+
// incompressibility (density deviation < 1%).
86+
}
87+
88+
__global__ void pcisphComputePressureForceAcceleration(/* TODO */) {
89+
// TODO: port from src/sphFluid.cl ::pcisph_computePressureForceAcceleration
90+
// gradWspiky; symmetric pressure force.
91+
}
92+
93+
__global__ void pcisphIntegrate(/* TODO */) {
94+
// TODO: port from src/sphFluid.cl ::pcisph_integrate
95+
// Leapfrog time integration; mode 0 = position update,
96+
// mode 1 = velocity update.
97+
}
98+
99+
// -----------------------------------------------------------------------
100+
// Membrane interaction
101+
// -----------------------------------------------------------------------
102+
103+
__global__ void clearMembraneBuffers(/* TODO */) {
104+
// TODO
105+
}
106+
107+
__global__ void computeInteractionWithMembranes(/* TODO */) {
108+
// TODO
109+
}
110+
111+
#endif // SIBERNETIC_CUDA_SPHFLUID_CU

0 commit comments

Comments
 (0)