Skip to content

Fix FP64 operations on conv_diff#199

Merged
b-fg merged 6 commits intomasterfrom
fix_conv_diff
Apr 21, 2025
Merged

Fix FP64 operations on conv_diff#199
b-fg merged 6 commits intomasterfrom
fix_conv_diff

Conversation

@b-fg
Copy link
Copy Markdown
Member

@b-fg b-fg commented Mar 11, 2025

In FP32 (T=Float32) GPU simulations, FP64 operations were detected in the conv_diff! routine with Nsight Compute (should also happen for CPU). This fix closes #197. I have tracked this down to the flux function

@inline ϕ(a,I,f) = @inbounds (f[I]+f[I-δ(a,I)])*0.5

where the 0.5 always promotes the operation to FP64. The fix is to use /2 instead, which preserves the floating point operation.

I have also done some additional type cleaning, and separated the @loop in conv_diff! into their own kernels. Benchmarks tests need to be conducted to see if the fix impacts performance or not.

@b-fg b-fg added the bug Something isn't working label Mar 11, 2025
@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Mar 11, 2025

Benchmarks do not show any speedup currently. I need to try with the original version of merged @loop.

Details
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Mar 11, 2025

It seems that the loop merging actually helps a bit with performance, see below (note 32c3661 is without loop merging). Comparing to master, the /2 yields similar results on GPU, and is a bit faster on CPU (should be opposite..?).

Details
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2153203 │   0.00 │    17.03 │            81.21 │     1.14 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2573880 │   0.00 │     3.06 │            14.59 │     6.34 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5276539 │   0.00 │    36.93 │           104.36 │     1.05 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5703067 │   0.00 │     7.53 │            21.28 │     5.15 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Copy Markdown
Member

I don't know enough about GPUs to tell you what to expect. Might need to call in an expert...

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Mar 12, 2025

I have verified with Nsight Compute that with the /2 fix removes FP64 operations in conv_diff! when running with T=Float32, while the overall solver performance is similar. Maybe @vchuravy or @maleadt can give us some quick feedback on how to correctly implement functions that are executed in GPU kernels which can return either FP32 or FP64 (as selected by users)?

@vchuravy
Copy link
Copy Markdown
Contributor

Sadly, there is no magic here. T(0.5) is basically th eonly thing you can do.

@b-fg
Copy link
Copy Markdown
Member Author

b-fg commented Apr 21, 2025

More benchmarks (see details) show around 5-10% speedup in multi-threading CPU and same in GPU. Still, in GPU, I now cannot see the FP64 operations in FP32 simulations, so that's an improvement anyways. Thus, we can merge this.

Details
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.19 │           159.88 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.04 │           153.99 │     1.04 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2329993 │   0.00 │     3.15 │           120.22 │     1.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2329993 │   0.00 │     2.88 │           109.76 │     1.46 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3620161 │   0.47 │     4.06 │           154.95 │     1.03 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3620161 │   0.45 │     3.82 │           145.55 │     1.10 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2978861 │   0.00 │     0.59 │            22.32 │     7.16 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2978757 │   0.00 │     0.60 │            23.02 │     6.95 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       75316 │   0.00 │    26.10 │           124.46 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       75316 │   0.00 │    24.93 │           118.88 │     1.05 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2168103 │   0.00 │    18.59 │            88.66 │     1.40 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2168103 │   0.00 │    16.96 │            80.86 │     1.54 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3367947 │   0.00 │    22.35 │           106.59 │     1.17 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3367947 │   0.00 │    21.17 │           100.94 │     1.23 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2724730 │   0.00 │     3.02 │            14.41 │     8.64 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2724448 │   0.00 │     3.02 │            14.40 │     8.64 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: sphere sim_step! (max_steps=100)
▶ log2p = 3
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       65686 │   0.00 │     4.03 │           136.49 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       64431 │   0.00 │     3.78 │           128.08 │     1.07 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     1868187 │   0.00 │     2.72 │            92.12 │     1.48 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     1870551 │   0.00 │     2.53 │            85.64 │     1.59 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     2906803 │   0.00 │     3.48 │           117.98 │     1.16 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     2910499 │   0.00 │     3.25 │           110.23 │     1.24 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2281132 │   0.00 │     0.50 │            16.80 │     8.13 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2291762 │   0.00 │     0.49 │            16.60 │     8.22 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 4
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │       69559 │   0.00 │    40.93 │           173.49 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │       69441 │   0.00 │    39.29 │           166.55 │     1.04 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     1973299 │   0.00 │    20.36 │            86.29 │     2.01 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     1968733 │   0.00 │    19.08 │            80.86 │     2.15 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     3074899 │   0.00 │    23.90 │           101.32 │     1.71 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     3067813 │   0.00 │    24.02 │           101.79 │     1.70 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2452175 │   0.00 │     3.70 │            15.68 │    11.06 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2449370 │   0.00 │     3.69 │            15.63 │    11.10 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │      161455 │   0.00 │     3.78 │           288.36 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │      161463 │   0.00 │     3.75 │           286.46 │     1.01 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     4456091 │   0.59 │     3.71 │           282.80 │     1.02 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     4456091 │   0.64 │     3.40 │           259.63 │     1.11 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     6930647 │   0.78 │     4.53 │           345.45 │     0.83 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     6930647 │   0.80 │     4.41 │           336.15 │     0.86 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5697955 │   0.00 │     1.02 │            78.14 │     3.69 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5697955 │   0.00 │     1.02 │            78.13 │     3.69 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │        master │ 1.11.3 │   Float32 │      208466 │   0.00 │    26.47 │           252.40 │     1.00 │
│     CPUx01 │ fix_conv_diff │ 1.11.3 │   Float32 │      208459 │   0.00 │    26.08 │           248.76 │     1.01 │
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5839681 │   0.14 │    18.83 │           179.54 │     1.41 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5839681 │   0.16 │    17.70 │           168.79 │     1.50 │
│     CPUx08 │        master │ 1.11.3 │   Float32 │     9095737 │   0.27 │    20.96 │           199.87 │     1.26 │
│     CPUx08 │ fix_conv_diff │ 1.11.3 │   Float32 │     9095737 │   0.26 │    20.45 │           195.07 │     1.29 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     7680342 │   0.40 │     3.88 │            36.96 │     6.83 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     7680447 │   0.39 │     3.86 │            36.80 │     6.86 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg b-fg merged commit 1118a07 into master Apr 21, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FP64 operations on FP32 simulations in conv_diff!

3 participants