Skip to content

Initial commit of a simple GPU-friendly Gauss-Seidel Red-Black smoother#254

Closed
weymouth wants to merge 3 commits intomasterfrom
GaussSidelRedBlack
Closed

Initial commit of a simple GPU-friendly Gauss-Seidel Red-Black smoother#254
weymouth wants to merge 3 commits intomasterfrom
GaussSidelRedBlack

Conversation

@weymouth
Copy link
Copy Markdown
Member

@weymouth weymouth commented Sep 29, 2025

This is an initial implementation of a Gauss-Seidel smoother which runs on GPUs by alternating the update in a checkerboard pattern.

  • This smoother has no accumulations, and should be less sensitive than PCG! to things like reduced precision. It might improve the AD with respect to body position, but I haven't tested it.
  • The Gauss-Seidel kernel is launched on the whole array twice, and LinearIndex[I]%2 is used to alternate between even/odd cells. Alternatively, we could use @vecloop from BiotBCs and save a vector of the red/black cells, which could be faster.
  • The smoother iterates a fixed it=6 times without residual checks. This is fast, but might be wasting effort or require more V-cycles.

Added perBC! and a slight adjustment of the function call for GPU. Now using GaussSeidelRB! as the MultiLevelPoisson smoother! passes all tests for CPU and GPU.
GaussSeidelRB! doesn't pass the tests for Poisson because it takes more iterations to converge than pcg! as a stand-alone solver, but who cares?
@b-fg
Copy link
Copy Markdown
Member

b-fg commented Oct 10, 2025

To do

  • Check performance for a given tolerance in benchmarks suite

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Oct 10, 2025

Performance with RBGS is similar to PCG, at least for middle size cases. I haven't touched the solver's tolerance / number of iterations, so maybe this is worth investigating.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │          WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │        1807 │   0.00 │     4.17 │           159.03 │     1.00 │
│     CPUx01 │             v1.5.2 │ 1.11.5 │   Float32 │        1807 │   0.00 │     4.17 │           159.08 │     1.00 │
│     CPUx02 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     1687271 │   0.00 │     4.83 │           184.44 │     0.86 │
│     CPUx02 │             v1.5.2 │ 1.11.5 │   Float32 │     1687271 │   0.00 │     4.83 │           184.38 │     0.86 │
│     CPUx04 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     2334995 │   0.00 │     2.90 │           110.49 │     1.44 │
│     CPUx04 │             v1.5.2 │ 1.11.5 │   Float32 │     2334995 │   0.00 │     2.92 │           111.50 │     1.43 │
│ GPU-NVIDIA │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     3041059 │   0.00 │     0.59 │            22.35 │     7.12 │
│ GPU-NVIDIA │             v1.5.2 │ 1.11.5 │   Float32 │     3042259 │   0.00 │     0.58 │            22.23 │     7.16 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │          WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │        1807 │   0.00 │    25.67 │           122.42 │     1.00 │
│     CPUx01 │             v1.5.2 │ 1.11.5 │   Float32 │        1807 │   0.00 │    25.74 │           122.72 │     1.00 │
│     CPUx02 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     1571048 │   0.00 │    29.48 │           140.57 │     0.87 │
│     CPUx02 │             v1.5.2 │ 1.11.5 │   Float32 │     1571048 │   0.00 │    29.04 │           138.48 │     0.89 │
│     CPUx04 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     2175214 │   0.00 │    16.87 │            80.42 │     1.53 │
│     CPUx04 │             v1.5.2 │ 1.11.5 │   Float32 │     2175214 │   0.00 │    16.85 │            80.35 │     1.53 │
│ GPU-NVIDIA │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     2784740 │   0.00 │     3.02 │            14.41 │     8.51 │
│ GPU-NVIDIA │             v1.5.2 │ 1.11.5 │   Float32 │     2786213 │   0.00 │     3.00 │            14.32 │     8.57 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │          WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │        7107 │   0.00 │     3.44 │           262.08 │     1.00 │
│     CPUx01 │             v1.5.2 │ 1.11.5 │   Float32 │        7107 │   0.00 │     3.44 │           262.74 │     1.00 │
│     CPUx02 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     3244853 │   0.00 │     5.34 │           407.54 │     0.64 │
│     CPUx02 │             v1.5.2 │ 1.11.5 │   Float32 │     3244853 │   0.00 │     5.31 │           405.18 │     0.65 │
│     CPUx04 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     4491947 │   0.68 │     3.20 │           244.25 │     1.08 │
│     CPUx04 │             v1.5.2 │ 1.11.5 │   Float32 │     4491947 │   0.65 │     3.20 │           244.00 │     1.08 │
│ GPU-NVIDIA │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     5805176 │   0.00 │     0.97 │            73.65 │     3.57 │
│ GPU-NVIDIA │             v1.5.2 │ 1.11.5 │   Float32 │     5806646 │   0.00 │     0.95 │            72.38 │     3.63 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬────────────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │          WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼────────────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │        8307 │   0.00 │    25.54 │           243.60 │     0.99 │
│     CPUx01 │             v1.5.2 │ 1.11.5 │   Float32 │        8307 │   0.00 │    25.33 │           241.55 │     1.00 │
│     CPUx02 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     4229004 │   0.00 │    28.76 │           274.28 │     0.88 │
│     CPUx02 │             v1.5.2 │ 1.11.5 │   Float32 │     4229004 │   0.00 │    28.96 │           276.19 │     0.87 │
│     CPUx04 │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     5860776 │   0.18 │    16.78 │           160.07 │     1.51 │
│     CPUx04 │             v1.5.2 │ 1.11.5 │   Float32 │     5860776 │   0.18 │    16.76 │           159.86 │     1.51 │
│ GPU-NVIDIA │ GaussSidelRedBlack │ 1.11.5 │   Float32 │     7780056 │   0.00 │     3.15 │            30.04 │     8.04 │
│ GPU-NVIDIA │             v1.5.2 │ 1.11.5 │   Float32 │     7785226 │   0.72 │     3.20 │            30.51 │     7.92 │
└────────────┴────────────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Copy Markdown
Member Author

Thanks. That seems promising. I'll add a tolerance check and do some fine tuning.

@weymouth
Copy link
Copy Markdown
Member Author

Replaced by #276

@weymouth weymouth closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants