Use combined vector loads on GPUs#1147
Use combined vector loads on GPUs#1147efaulhaber wants to merge 40 commits intotrixi-framework:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1147 +/- ##
==========================================
- Coverage 89.17% 88.68% -0.50%
==========================================
Files 128 129 +1
Lines 9925 10011 +86
==========================================
+ Hits 8851 8878 +27
- Misses 1074 1133 +59
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/run-gpu-tests |
|
/run-gpu-tests |
|
/run-gpu-tests |
| # which can significantly improve performance on GPUs. | ||
| block_size = div(64, sizeof(ELTYPE)) | ||
| else | ||
| # There is no performance benefit to aligning ranges for CPU backends. |
There was a problem hiding this comment.
Orthogonal to the question above: Why not also do the alignment on the CPU? Unless there is a significant benefit from not doing this on the CPU, I'd recommend to always use the same alignment everywhere, every time. This makes it much easier to reason about differences between CPU and GPU code, and ensures that you do not accidentally screw this up on the GPU (but miss it, because your development happens on the CPU).
There was a problem hiding this comment.
Adding padding here means we add dummy values to the time integration that have to be computed as well. If there is no benefit on the CPU, why would I increase the workload?
There was a problem hiding this comment.
Given that you add at most 15 additional FP32 values to the time integrator per system, I'd say that's negligible.
If there is no benefit on the CPU, why would I increase the workload?
Maybe I was not clear above: From experience, I'd prefer a consistent memory layout among nearly all backends (CPU, GPU, whatever) over a potential (but very likely negligible) increase in the overall workload. I've spend too much time on OOB errors/Valgrind runs/Heisenbug chases, thus making the code simpler and harder to use wrongly is very high on my priority list 😅
Adding at most 15 FP32 values (for 64 byte alignment), which corresponds to - at most - 5 additional particles for the time integrator, I'd say that's a pretty good deal. But I'll leave the decision up to you.
|
@sloede I now ran a benchmark with the TLSPH kernel, which reads two 2x2 matrices per particle-neighbor pair that can be aligned, so we expect a larger difference in performance here. As opposed to the integration array with the padding, these are individual arrays that are always aligned on the GPU, but not guaranteed on the CPU (although in practice always aligned for large enough arrays). Note that for the non-aligned version I just added an offset of 8 bytes at the beginning.
Since |
Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size. |
Yes, that's the question. Especially in performance-critical sections, if I can get by with just a single implementation for everything, I'd prefer that. You always need to (or should) keep in mind that you're not writing this code just for yourself, but also the next generation of researchers who might have much less experience in performance engineering and probably highly value simplicity in these "infrastructure" regions of the code. But as I said, it's a question of prioritizing different objectives 🤷♂️
I agree, this would be interesting to know. |
The same 4M particles benchmark that I run on the GPUs. |
Hmm than modern CPUs are just so optimized that it doesn't matter. |
Based on #1116.