Optimized Red-Black Gauss-Seidel on GPU by TzuYaoHuang · Pull Request #276 · WaterLily-jl/WaterLily.jl

TzuYaoHuang · 2026-03-22T14:21:20Z

This is a continuation of pr #254, but with more optimized kernel.

Kernel that can determine red and black cell without if statement, reducing the overhead on GPU by 50%.
Customize mult kernel that remove the redundant diagonal multiplication, another 10% performance gain.

However, the kernel seems still expansive compared to the original pcg. Need diagnosis.

- Remove the usage of `mult` - Use half of the grid to shift the grid vertically according to RB.

TzuYaoHuang · 2026-03-22T14:35:19Z

Som initial benchmarking on CUDA GPU.

iccaf76: master branch
dfdc965: current pr

Benchmark environment: tgv sim_step! (max_steps=400)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │     7629129 │   1.33 │     2.11 │            20.16 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │     6956449 │   1.49 │     1.40 │            13.34 │     1.51 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    11439482 │   0.44 │     7.05 │             8.41 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │     7950303 │   0.51 │     4.99 │             5.94 │     1.42 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 8
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    10742862 │   0.11 │    28.48 │             4.24 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │     8625259 │   0.11 │    27.63 │             4.12 │     1.03 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 9
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │     9500646 │   0.02 │   177.19 │             3.30 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │     9352332 │   0.02 │   228.90 │             4.26 │     0.77 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=400)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    18432069 │   7.95 │    12.13 │             8.57 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    13084382 │   9.40 │     9.73 │             6.87 │     1.25 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    22919551 │   2.82 │    64.88 │             5.73 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    14906204 │   1.10 │    55.53 │             4.90 │     1.17 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    22981795 │   0.51 │   445.85 │             4.92 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    16237211 │   0.20 │   437.07 │             4.82 │     1.02 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=400)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    20982480 │   0.65 │     7.69 │            18.34 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    12607730 │   0.54 │     4.53 │            10.79 │     1.70 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    24345593 │   0.40 │    34.32 │            10.23 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    14110259 │   0.37 │    21.60 │             6.44 │     1.59 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 8
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│    Backend │ WaterLily │  Julia │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│ GPU-NVIDIA │   1ccaf76 │ 1.12.5 │   Float32 │    37423319 │   0.07 │   220.30 │             8.21 │     1.00 │
│ GPU-NVIDIA │   dfdc965 │ 1.12.5 │   Float32 │    17390891 │   0.07 │   166.98 │             6.22 │     1.32 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

I found that the code is faster using only 4 total GSRB smoother iterations (so 2 red 2 black). This is small enough that we can also move the perBC! out of the loop. Since we don't have perBCs for the benchmark cases, I'm not sure if this is optimal or not.

The optimized working example

dfdc965

- Remove the usage of `mult` - Use half of the grid to shift the grid vertically according to RB.

TzuYaoHuang and others added 2 commits March 23, 2026 22:28

ignore nsight report

925b15b

Small optimizations

eae8d1b

I found that the code is faster using only 4 total GSRB smoother iterations (so 2 red 2 black). This is small enough that we can also move the perBC! out of the loop. Since we don't have perBCs for the benchmark cases, I'm not sure if this is optimal or not.

weymouth mentioned this pull request Mar 25, 2026

Initial commit of a simple GPU-friendly Gauss-Seidel Red-Black smoother #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Red-Black Gauss-Seidel on GPU#276

Optimized Red-Black Gauss-Seidel on GPU#276
TzuYaoHuang wants to merge 3 commits intoWaterLily-jl:masterfrom
TzuYaoHuang:OptmizeRBGaussSidel

TzuYaoHuang commented Mar 22, 2026

Uh oh!

TzuYaoHuang commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TzuYaoHuang commented Mar 22, 2026

Uh oh!

TzuYaoHuang commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants