Conversation
|
Benchmarks do not show any speedup currently. I need to try with the original version of merged Details |
|
It seems that the loop merging actually helps a bit with performance, see below (note Details |
|
I don't know enough about GPUs to tell you what to expect. Might need to call in an expert... |
|
I have verified with Nsight Compute that with the |
|
Sadly, there is no magic here. |
|
More benchmarks (see details) show around 5-10% speedup in multi-threading CPU and same in GPU. Still, in GPU, I now cannot see the FP64 operations in FP32 simulations, so that's an improvement anyways. Thus, we can merge this. Details |
In FP32 (
T=Float32) GPU simulations, FP64 operations were detected in theconv_diff!routine with Nsight Compute (should also happen for CPU). This fix closes #197. I have tracked this down to the flux functionWaterLily.jl/src/Flow.jl
Line 3 in 0c05f4d
where the
0.5always promotes the operation to FP64. The fix is to use/2instead, which preserves the floating point operation.I have also done some additional type cleaning, and separated the
@loopinconv_diff!into their own kernels. Benchmarks tests need to be conducted to see if the fix impacts performance or not.