Skip to content

Use preferences to switch between SIMD and KernelAbstractions#133

Merged
b-fg merged 22 commits intoWaterLily-jl:masterfrom
vchuravy:vc/backends
May 24, 2025
Merged

Use preferences to switch between SIMD and KernelAbstractions#133
b-fg merged 22 commits intoWaterLily-jl:masterfrom
vchuravy:vc/backends

Conversation

@vchuravy
Copy link
Copy Markdown
Contributor

I was experimenting with using PrecompileTools on WaterLily, and the choice to dispatch to the SIMD
backend depending on the nthreads variable caused issues.

  1. In current versions of Julia nthreads is no longer constant.
  2. If someone precompiles code nthreads == 1 in the precompilation process, thus exercising the wrong code path.

Opening this as a draft for now to solicit feedback. One probably would need to change the tests such that both code-paths are tested.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Jun 21, 2024

Thanks for catching that. I was aware that nthreads==1 during precompilation was problematic, but during execution it was working as intended. Using Preferences seems like a nice workaround. I will do some tests and integrate it.

Also, not specifying the workgroup size did not yield to noticeable performance increase compared to 64 in the past (iirc). Has something changed in KA related to this? Is it anyways the recommended guideline to setup kernels?

@vchuravy
Copy link
Copy Markdown
Contributor Author

Is it anyways the recommended guideline to setup kernels?

It is a bit tricky between CPU and GPU. Right now the the KA backend on the CPU is rather slow since the basecase size is small. The CPU does much better with larger basecases. Now we don't have a way to calculate that basecase automatically so we use 1024 on the CPU as a default.

On the GPU a static basecase is nice since it allows for some of the index integer operations to be optimized away.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Jun 21, 2024

I did some preliminary benchmarks with different mesh sizes N=2^(3*p) using this PR. Overall, it seems that the current PR is a bit slower than master on GPU. The only main difference is that the workgroupsize is now not specified. Results are below, where the commits (which are wrongly tagged) refer to 33933fd==PR and a8a2506==master:

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     3028166 │   1.41 │     0.58 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2672719 │   2.11 │     0.55 │     1.05 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2671525 │   1.42 │     0.79 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2339494 │   1.41 │     0.78 │     1.01 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 8
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2085611 │   0.38 │     2.98 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1816307 │   0.25 │     2.79 │     1.07 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 9
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2160883 │   0.08 │    21.20 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1798143 │   0.05 │    19.42 │     1.09 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

@vchuravy
Copy link
Copy Markdown
Contributor Author

It would be interesting to use CUDA.@profile to see if the kernel slowed down or the "auto-tunning" adds that overhead

@weymouth
Copy link
Copy Markdown
Member

On my laptop GPU, I found no regression with this PR. In fact a very small speed up:

TGV (b01cdce is this PR, 5c78c37 is this PR with 64 workgroup size, f38bea4 is master)

▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166654 │   0.38 │     2.28 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2745665 │   0.58 │     2.87 │     0.80 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2799117 │   0.66 │     2.37 │     0.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2787354 │   0.12 │     7.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2394736 │   0.19 │     7.87 │     0.99 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2442026 │   0.15 │     7.80 │     1.00 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

Jelly

▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2976119 │   0.53 │     1.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2602224 │   0.46 │     2.01 │     0.91 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2652446 │   0.47 │     1.97 │     0.93 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166982 │   0.24 │     5.45 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2747379 │   0.17 │     5.74 │     0.95 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2801011 │   0.15 │     5.75 │     0.95 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

@b-fg b-fg mentioned this pull request Jul 23, 2024
@b-fg
Copy link
Copy Markdown
Member

b-fg commented Aug 1, 2024

I did some more benchmarks after a local merge of master with this PR. All looks good except removing the workgroup size as we had it it before (64). Here 9b6ca77 is this PR merged with master and no workgroup size, and backends is this PR merged with master and with workgroup size 64. There is something going on for the CPU backend of KA when not specifying the workgroup size, making it slower than the serial SIMD version. This is with latest KA version (0.9.22).

Benchmarks
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.37 │           395.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.26 │           391.32 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.33 │           394.04 │     1.00 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.31 │           126.28 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     2187731 │   0.00 │    17.90 │           682.65 │     0.58 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.14 │           119.75 │     3.30 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.22 │           122.65 │     3.23 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3465887 │   0.00 │    16.89 │           644.44 │     0.61 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.37 │           128.56 │     3.08 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2619999 │   0.00 │     0.66 │            25.09 │    15.77 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3030802 │   0.00 │     0.65 │            24.62 │    16.07 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671213 │   0.00 │     0.63 │            24.02 │    16.47 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.01 │           276.59 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.53 │           274.34 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    73.38 │           349.91 │     0.79 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    18.52 │            88.29 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1945182 │   0.00 │    66.66 │           317.85 │     0.87 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    18.50 │            88.21 │     3.14 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    20.32 │            96.90 │     2.85 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3082782 │   0.00 │    63.37 │           302.19 │     0.92 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    19.00 │            90.61 │     3.05 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2301906 │   0.00 │     3.11 │            14.82 │    18.66 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2683290 │   0.00 │     3.24 │            15.46 │    17.89 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347006 │   0.00 │     3.06 │            14.59 │    18.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.76 │           591.96 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.74 │           590.32 │     1.00 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.69 │           586.70 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.55 │     4.46 │           340.46 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5341074 │   0.27 │    26.74 │          2040.17 │     0.29 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.54 │     4.52 │           344.74 │     1.72 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   2.39 │     4.92 │           375.75 │     1.58 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     8437098 │   0.21 │    26.96 │          2057.04 │     0.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   0.00 │     4.89 │           372.75 │     1.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416700 │   0.00 │     1.46 │           111.76 │     5.30 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253464 │   0.00 │     1.48 │           112.95 │     5.24 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6542128 │   0.00 │     1.45 │           110.63 │     5.35 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.23 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.88 │           561.49 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.05 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.38 │    20.22 │           192.87 │     2.93 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     6305699 │   0.07 │   121.32 │          1156.96 │     0.49 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.38 │    20.15 │           192.15 │     2.94 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.69 │    21.92 │           209.08 │     2.70 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     9975647 │   0.12 │   121.08 │          1154.70 │     0.49 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.88 │    21.55 │           205.50 │     2.75 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.77 │    12.63 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8713607 │   0.00 │     4.73 │            45.10 │    12.53 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7792785 │   0.00 │     4.70 │            44.78 │    12.62 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@vchuravy
Copy link
Copy Markdown
Contributor Author

vchuravy commented Aug 1, 2024

So the default workgroupsize for KA is 1024. With 64 you create a lot of small tasks, what is the typical ndrange you use?

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Aug 1, 2024

For example, the TGV case is a 3D case for which I tested domain sizes of 64^3 and 128^3. The arrays we use are then (64,64,64) and (64,64,64,3) (analogously for the 128^3 grid), which is the ndrange we typically pass into the kernel. Also, I am not sure I tested this PR before with multi-threading on the CPU backend... I think it was just on thje GPU (as reported previously).

@vchuravy
Copy link
Copy Markdown
Contributor Author

vchuravy commented Aug 2, 2024

Ah so you are getting perfectly sized blocks, by accident xD

You may want to use (64, 64) instead as the workgroup size.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Aug 4, 2024

Sure, I will do some tests after my summer break. But does this mean that we cannot use the default workgrup size (as in this PR)? Could this be something to improve in KA, where it would try to automatically determine it based on ndrange?

@vchuravy
Copy link
Copy Markdown
Contributor Author

vchuravy commented Aug 5, 2024

Yeah I will need to improve this on the KA side

@vchuravy
Copy link
Copy Markdown
Contributor Author

vchuravy commented Aug 7, 2024

I just tagged a new KA version with the fix. This might remove the need for the SIMD variant entirely.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Aug 21, 2024

I have tested the changes and while the results improve, it is still not there (again, 9b6ca77 is this PR). There might be something else going on but unsure what at the moment...

Benchmarks
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.42 │           397.53 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.24 │           390.64 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.29 │           392.71 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.34 │           127.43 │     3.12 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1993389 │   0.00 │     4.25 │           162.06 │     2.45 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.20 │           121.89 │     3.26 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.24 │           123.48 │     3.22 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2647077 │   0.00 │     4.41 │           168.25 │     2.36 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.32 │           126.53 │     3.14 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2621768 │   0.00 │     0.65 │            24.76 │    16.05 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3026963 │   0.00 │     0.63 │            23.91 │    16.62 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671140 │   0.00 │     0.68 │            25.79 │    15.42 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.85 │           280.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.75 │           275.38 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.62 │           274.74 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    20.92 │            99.76 │     2.81 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1819590 │   0.00 │    24.61 │           117.34 │     2.39 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    21.37 │           101.89 │     2.75 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    19.24 │            91.74 │     3.06 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2737306 │   0.00 │    25.67 │           122.38 │     2.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    22.54 │           107.47 │     2.61 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2303706 │   0.00 │     3.09 │            14.71 │    19.07 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2680490 │   0.00 │     3.25 │            15.48 │    18.12 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347008 │   0.00 │     3.16 │            15.05 │    18.65 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.89 │           601.60 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.41 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.59 │     4.54 │           346.53 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     3954012 │   0.00 │     4.35 │           331.67 │     1.81 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.49 │     4.73 │           361.22 │     1.67 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   0.00 │     4.85 │           369.75 │     1.63 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     5237268 │   0.00 │     5.04 │           384.33 │     1.57 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   2.37 │     5.15 │           392.91 │     1.53 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416699 │   0.00 │     1.45 │           110.72 │     5.43 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253470 │   0.00 │     1.47 │           112.14 │     5.36 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6538380 │   0.00 │     1.48 │           112.78 │     5.33 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.22 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.71 │           559.87 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.37 │    20.44 │           194.94 │     2.90 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5129556 │   0.00 │    29.92 │           285.32 │     1.98 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.39 │    21.95 │           209.37 │     2.70 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.81 │    21.45 │           204.58 │     2.76 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     7343892 │   0.00 │    30.12 │           287.21 │     1.97 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.61 │    22.87 │           218.09 │     2.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.69 │    12.65 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8717354 │   0.00 │     4.77 │            45.46 │    12.43 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7787222 │   0.00 │     4.70 │            44.86 │    12.60 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@marinlauber
Copy link
Copy Markdown
Member

marinlauber commented Sep 13, 2024

@b-fg Something I picked up out today. Currently, the maintest will run on CuArray only if you have nvcc installed. If you use the Julia CUDA compiler it doesn't install nvcc (at least not on my system). Same goes for AMG GPU I suppose.

It's kind of related to this I suppose, that's why I added it here.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented Sep 13, 2024

Ah but this is not a problem of this PR, but a problem of WaterLily-Benchmarks, right? If you open an issue there, we can iterate on it.

You mean these test lines , right?

_cuda = check_compiler("nvcc","release")

This is not related to this PR though. The problem is how to automatically detect that CUDA is available without loading CUDA.jl before, and come up with something that works for all OS.

b-fg added 2 commits May 22, 2025 00:26
When the SIMD backend is selected, the loop macro generates only the for loop, without a function wrapper.
Also, dispatch based on number of threads has been removed, and now only the backend-specific kernel is compiled.
The CI needs to be fixed for the allocations tests, which first needs to set the SIMD backend, and then re-run the tests.
@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 21, 2025

As a result of our conversation in #198, I thought it was about time to put this in use... So I have cleaned up a bit the Preferences.jl routines with the new API, and now @loop only compiles the kernel specific to the selected backend (dynamic dispatch based on number of threads has been removed).

The only thing left to figure out is the allocation tests, which currently I do not know how to update the CI so that the -t 1 tests are launched "twice", one to set the backend and one to compile it. Maybe we have to use a fabricated test command for this, instead of julia-actions/julia-runtest@v1.

@b-fg b-fg marked this pull request as ready for review May 21, 2025 23:19
@codecov
Copy link
Copy Markdown

codecov bot commented May 21, 2025

Codecov Report

Attention: Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/WaterLily.jl 0.00% 3 Missing ⚠️
src/util.jl 85.00% 3 Missing ⚠️
Files with missing lines Coverage Δ
src/WaterLily.jl 65.90% <0.00%> (-24.10%) ⬇️
src/util.jl 80.34% <85.00%> (-1.24%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 21, 2025

To do:

  • Automate CI with single thread and Preferences
  • Check again performance of not specifying workgroup size VS 64 (master) VS (64,64)
  • Get rid of warning when compiling for KA backend.
  • Implement function specialization for @loop kernels

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

I have experimenting again with the workgroup size, and the best results are almost always with constant 64. Also, something that I do not understand, is that the single-thread benchmarks are ~40% slower when the @simd for... loop is not wrapped within a function. That is, the following implementation

@simd for $I  $R
    @fastmath @inbounds $ex
end

is 40% slower than this one:

function $kern($(rep.(sym)...))
    @simd for $I  $R
        @fastmath @inbounds $ex
    end
end
$kern($(sym...))

I do not understand why. In any case, I have reverted back to the other wrapped one, which is similar to what we have in master, but without using the dynamic dispatch based on threads number. And now I think we should really try the specialization for each argument as we discussed, @weymouth.

With the current PR state, these are the benchmarks:

Benchmark environment: tgv sim_step! (max_steps=100)                                                                                                                            
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.18 │           159.45 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.20 │           160.40 │     0.99 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     2273557 │   0.00 │     3.44 │           131.21 │     1.22 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     2329993 │   0.00 │     2.98 │           113.74 │     1.40 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     2933922 │   0.00 │     0.59 │            22.55 │     7.07 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     2978161 │   0.00 │     0.58 │            22.12 │     7.21 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.42 │           121.19 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │       75316 │   0.00 │    25.94 │           123.70 │     0.98 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     2116285 │   0.00 │    18.10 │            86.31 │     1.40 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     2168103 │   0.00 │    17.17 │            81.88 │     1.48 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     2680201 │   0.00 │     3.01 │            14.37 │     8.43 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     2723973 │   0.00 │     3.01 │            14.35 │     8.45 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.77 │           287.55 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │      161463 │   0.00 │     3.78 │           288.53 │     1.00 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     4357685 │   0.55 │     3.50 │           267.29 │     1.08 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     4456091 │   0.57 │     3.46 │           264.08 │     1.09 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     5627919 │   1.23 │     1.04 │            79.02 │     3.64 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     5705257 │   1.31 │     1.09 │            82.92 │     3.47 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        8307 │   0.00 │    26.63 │           253.92 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │      208459 │   0.00 │    26.98 │           257.32 │     0.99 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     5712303 │   0.16 │    17.73 │           169.11 │     1.50 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     5839681 │   0.16 │    18.12 │           172.81 │     1.47 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     7596622 │   0.38 │     3.82 │            36.46 │     6.96 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     7680551 │   0.40 │     3.81 │            36.31 │     6.99 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

New CI working nicely with LocalPreferences!

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

@vchuravy any ideas on how to bypass the warning of using KA in single thread (instead of SIMD) during precompilation? Otherwise, since we use KA as default backend and precompilation typically happens with a single thread, the warning is always popping up, which is not ideal. I am not sure how this can be addressed.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

The @loop macro now implements automatic specialization for the function wrapping the kernels, so that all arguments are included such as

function kern(a::A,b::B,c::C,...) where {A,B,C,...}
    ...
end

Below is an example

Details
@macroexpand WaterLily.@loop a[I] += b[I] over I in CartesianIndices(a)
quote
    #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:155 =#
    function var"##kern#242"(a::DBGS, b::FHGH) where {DBGS, FHGH}
        #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:155 =#
        #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:156 =#
        begin
            #= simdloop.jl:69 =#
            let var"##r#244" = CartesianIndices(a)
                #= simdloop.jl:70 =#
                for var"##i#245" = Base.simd_outer_range(var"##r#244")
                    #= simdloop.jl:71 =#
                    let var"##n#246" = Base.simd_inner_length(var"##r#244", var"##i#245")
                        #= simdloop.jl:72 =#
                        if zero(var"##n#246") < var"##n#246"
                            #= simdloop.jl:74 =#
                            let var"##i#247" = zero(var"##n#246")
                                #= simdloop.jl:75 =#
                                while var"##i#247" < var"##n#246"
                                    #= simdloop.jl:76 =#
                                    local I = Base.simd_index(var"##r#244", var"##i#245", var"##i#247")
                                    #= simdloop.jl:77 =#
                                    begin
                                        #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:157 =#
                                        begin
                                            $(Expr(:inbounds, true))
                                            local var"#4#val" = (a[I] += b[I])
                                            $(Expr(:inbounds, :pop))
                                            var"#4#val"
                                        end
                                        #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:158 =#
                                    end
                                    #= simdloop.jl:78 =#
                                    var"##i#247" += 1
                                    #= simdloop.jl:79 =#
                                    $(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))
                                    #= simdloop.jl:80 =#
                                end
                            end
                        end
                    end
                    #= simdloop.jl:84 =#
                end
            end
            #= simdloop.jl:86 =#
            nothing
        end
    end
    #= /home/b-fg/Workspace/tudelft1/WaterLily.jl/src/util.jl:160 =#
    var"##kern#242"(a, b)
end

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

Consistent 1-2% speedup on all backends, and we will (hopefully) be able to specialize kernels passing a function. So this ticks all the boxes :)

Benchmarks
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        1807 │   0.00 │     4.11 │           156.91 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │       80521 │   0.00 │     4.20 │           160.40 │     0.98 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     2273557 │   0.00 │     2.94 │           112.31 │     1.40 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     2329993 │   0.00 │     2.98 │           113.74 │     1.38 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     2933921 │   0.00 │     0.58 │            22.19 │     7.07 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     2978161 │   0.00 │     0.58 │            22.12 │     7.09 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        1807 │   0.00 │    25.42 │           121.23 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │       75316 │   0.00 │    25.94 │           123.70 │     0.98 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     2116285 │   0.00 │    17.01 │            81.13 │     1.49 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     2168103 │   0.00 │    17.17 │            81.88 │     1.48 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     2680200 │   0.00 │     2.98 │            14.19 │     8.54 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     2723973 │   0.00 │     3.01 │            14.35 │     8.45 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        7107 │   0.00 │     3.75 │           285.84 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │      161463 │   0.00 │     3.78 │           288.53 │     0.99 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     4357685 │   0.60 │     3.42 │           261.09 │     1.09 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     4456091 │   0.57 │     3.46 │           264.08 │     1.08 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     5627921 │   0.00 │     1.02 │            77.65 │     3.68 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     5705257 │   1.31 │     1.09 │            82.92 │     3.45 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │  backends │ 1.11.3 │   Float32 │        8307 │   0.00 │    26.49 │           252.60 │     1.00 │
│     CPUx01 │    master │ 1.11.3 │   Float32 │      208459 │   0.00 │    26.98 │           257.32 │     0.98 │
│     CPUx04 │  backends │ 1.11.3 │   Float32 │     5712303 │   0.15 │    17.84 │           170.11 │     1.48 │
│     CPUx04 │    master │ 1.11.3 │   Float32 │     5839681 │   0.16 │    18.12 │           172.81 │     1.46 │
│ GPU-NVIDIA │  backends │ 1.11.3 │   Float32 │     7604281 │   0.52 │     3.80 │            36.19 │     6.98 │
│ GPU-NVIDIA │    master │ 1.11.3 │   Float32 │     7680551 │   0.40 │     3.81 │            36.31 │     6.96 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Details

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 22, 2025

We should address the warning issue, and this PR is good to go!

@marinlauber
Copy link
Copy Markdown
Member

Consistent 1-2% speedup on all backends, and we will (hopefully) be able to specialize kernels passing a function. So this ticks all the boxes :)

Nice! It's interesting that allocations are down for single threads but not for multiple threads.

@b-fg
Copy link
Copy Markdown
Member

b-fg commented May 23, 2025

Yes! Removing the dynamic dispatch based on number of threads, and instead just compiling the SIMD kernel based on LocalPreferences, brought down allocations significantly for single thread. Then the general small speedup gain resulted from specializing the wrapper function.

@b-fg b-fg merged commit 3b304d0 into WaterLily-jl:master May 24, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants