Fix FP64 operations on `conv_diff` by b-fg · Pull Request #199 · WaterLily-jl/WaterLily.jl

b-fg · 2025-03-11T22:17:11Z

In FP32 (T=Float32) GPU simulations, FP64 operations were detected in the conv_diff! routine with Nsight Compute (should also happen for CPU). This fix closes #197. I have tracked this down to the flux function

WaterLily.jl/src/Flow.jl

Line 3 in 0c05f4d

@inline ϕ(a,I,f) = @inbounds (f[I]+f[I-δ(a,I)])*0.5

where the 0.5 always promotes the operation to FP64. The fix is to use /2 instead, which preserves the floating point operation.

I have also done some additional type cleaning, and separated the @loop in conv_diff! into their own kernels. Benchmarks tests need to be conducted to see if the fix impacts performance or not.

b-fg · 2025-03-11T22:32:03Z

Benchmarks do not show any speedup currently. I need to try with the original version of merged @loop.

Details

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg · 2025-03-11T22:40:00Z

It seems that the loop merging actually helps a bit with performance, see below (note 32c3661 is without loop merging). Comparing to master, the /2 yields similar results on GPU, and is a bit faster on CPU (should be opposite..?).

Details

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2153203 │   0.00 │    17.03 │            81.21 │     1.14 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2573880 │   0.00 │     3.06 │            14.59 │     6.34 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5276539 │   0.00 │    36.93 │           104.36 │     1.05 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5703067 │   0.00 │     7.53 │            21.28 │     5.15 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

weymouth · 2025-03-12T09:07:44Z

I don't know enough about GPUs to tell you what to expect. Might need to call in an expert...

b-fg · 2025-03-12T09:18:44Z

I have verified with Nsight Compute that with the /2 fix removes FP64 operations in conv_diff! when running with T=Float32, while the overall solver performance is similar. Maybe @vchuravy or @maleadt can give us some quick feedback on how to correctly implement functions that are executed in GPU kernels which can return either FP32 or FP64 (as selected by users)?

vchuravy · 2025-03-12T09:32:43Z

Sadly, there is no magic here. T(0.5) is basically th eonly thing you can do.

b-fg · 2025-04-21T16:24:49Z

More benchmarks (see details) show around 5-10% speedup in multi-threading CPU and same in GPU. Still, in GPU, I now cannot see the FP64 operations in FP32 simulations, so that's an improvement anyways. Thus, we can merge this.

Details

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.19 │           159.88 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.04 │           153.99 │     1.04 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2329993 │   0.00 │     3.15 │           120.22 │     1.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2329993 │   0.00 │     2.88 │           109.76 │     1.46 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3620161 │   0.47 │     4.06 │           154.95 │     1.03 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3620161 │   0.45 │     3.82 │           145.55 │     1.10 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2978861 │   0.00 │     0.59 │            22.32 │     7.16 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2978757 │   0.00 │     0.60 │            23.02 │     6.95 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       75316 │   0.00 │    26.10 │           124.46 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       75316 │   0.00 │    24.93 │           118.88 │     1.05 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2168103 │   0.00 │    18.59 │            88.66 │     1.40 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2168103 │   0.00 │    16.96 │            80.86 │     1.54 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3367947 │   0.00 │    22.35 │           106.59 │     1.17 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3367947 │   0.00 │    21.17 │           100.94 │     1.23 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2724730 │   0.00 │     3.02 │            14.41 │     8.64 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2724448 │   0.00 │     3.02 │            14.40 │     8.64 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       65686 │   0.00 │     4.03 │           136.49 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       64431 │   0.00 │     3.78 │           128.08 │     1.07 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     1868187 │   0.00 │     2.72 │            92.12 │     1.48 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     1870551 │   0.00 │     2.53 │            85.64 │     1.59 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     2906803 │   0.00 │     3.48 │           117.98 │     1.16 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     2910499 │   0.00 │     3.25 │           110.23 │     1.24 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2281132 │   0.00 │     0.50 │            16.80 │     8.13 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2291762 │   0.00 │     0.49 │            16.60 │     8.22 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       69559 │   0.00 │    40.93 │           173.49 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       69441 │   0.00 │    39.29 │           166.55 │     1.04 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     1973299 │   0.00 │    20.36 │            86.29 │     2.01 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     1968733 │   0.00 │    19.08 │            80.86 │     2.15 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3074899 │   0.00 │    23.90 │           101.32 │     1.71 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3067813 │   0.00 │    24.02 │           101.79 │     1.70 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2452175 │   0.00 │     3.70 │            15.68 │    11.06 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2449370 │   0.00 │     3.69 │            15.63 │    11.10 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │      161455 │   0.00 │     3.78 │           288.36 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │      161463 │   0.00 │     3.75 │           286.46 │     1.01 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     4456091 │   0.59 │     3.71 │           282.80 │     1.02 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     4456091 │   0.64 │     3.40 │           259.63 │     1.11 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     6930647 │   0.78 │     4.53 │           345.45 │     0.83 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     6930647 │   0.80 │     4.41 │           336.15 │     0.86 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5697955 │   0.00 │     1.02 │            78.14 │     3.69 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5697955 │   0.00 │     1.02 │            78.13 │     3.69 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │      208466 │   0.00 │    26.47 │           252.40 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │      208459 │   0.00 │    26.08 │           248.76 │     1.01 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5839681 │   0.14 │    18.83 │           179.54 │     1.41 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5839681 │   0.16 │    17.70 │           168.79 │     1.50 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     9095737 │   0.27 │    20.96 │           199.87 │     1.26 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     9095737 │   0.26 │    20.45 │           195.07 │     1.29 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     7680342 │   0.40 │     3.88 │            36.96 │     6.83 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     7680447 │   0.39 │     3.86 │            36.80 │     6.86 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg added 4 commits March 11, 2025 16:59

Reset RHS with appropiate type.

5443b82

Separated internal cells kernel loops.

ab37314

Fixed type instability in ϕ

26c780d

reset to 0.0 initializations in Flow

32c3661

b-fg added the bug Something isn't working label Mar 11, 2025

Reverted loop merging in conv_diff

289cce9

Merged master.

6f0db2b

b-fg merged commit 1118a07 into master Apr 21, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FP64 operations on `conv_diff`#199

Fix FP64 operations on `conv_diff`#199
b-fg merged 6 commits intomasterfrom
fix_conv_diff

b-fg commented Mar 11, 2025

Uh oh!

b-fg commented Mar 11, 2025 •

edited

Loading

Uh oh!

b-fg commented Mar 11, 2025 •

edited

Loading

Uh oh!

weymouth commented Mar 12, 2025

Uh oh!

b-fg commented Mar 12, 2025

Uh oh!

vchuravy commented Mar 12, 2025

Uh oh!

b-fg commented Apr 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b-fg commented Mar 11, 2025

Uh oh!

b-fg commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b-fg commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weymouth commented Mar 12, 2025

Uh oh!

b-fg commented Mar 12, 2025

Uh oh!

vchuravy commented Mar 12, 2025

Uh oh!

b-fg commented Apr 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b-fg commented Mar 11, 2025 •

edited

Loading

b-fg commented Mar 11, 2025 •

edited

Loading