Skip to content

Use combined vector loads on GPUs#1147

Draft
efaulhaber wants to merge 40 commits intotrixi-framework:mainfrom
efaulhaber:vload
Draft

Use combined vector loads on GPUs#1147
efaulhaber wants to merge 40 commits intotrixi-framework:mainfrom
efaulhaber:vload

Conversation

@efaulhaber
Copy link
Copy Markdown
Member

@efaulhaber efaulhaber commented Apr 17, 2026

Based on #1116.

@efaulhaber efaulhaber self-assigned this Apr 17, 2026
@efaulhaber efaulhaber added the breaking changes This change will break the public API and requires a new major release label Apr 17, 2026
@efaulhaber efaulhaber mentioned this pull request Apr 17, 2026
6 tasks
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 19, 2026

Codecov Report

❌ Patch coverage is 60.56338% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.68%. Comparing base (a3f1139) to head (ba48cc1).

Files with missing lines Patch % Lines
src/general/gpu.jl 0.00% 25 Missing ⚠️
src/schemes/fluid/weakly_compressible_sph/rhs.jl 72.91% 13 Missing ⚠️
src/schemes/structure/total_lagrangian_sph/rhs.jl 50.00% 11 Missing ⚠️
src/general/abstract_system.jl 85.71% 2 Missing ⚠️
src/general/neighborhood_search.jl 0.00% 2 Missing ⚠️
test/examples/gpu.jl 0.00% 2 Missing ⚠️
src/general/semidiscretization.jl 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1147      +/-   ##
==========================================
- Coverage   89.17%   88.68%   -0.50%     
==========================================
  Files         128      129       +1     
  Lines        9925    10011      +86     
==========================================
+ Hits         8851     8878      +27     
- Misses       1074     1133      +59     
Flag Coverage Δ
total 88.68% <60.56%> (-0.50%) ⬇️
unit 67.36% <56.33%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

Comment thread src/general/semidiscretization.jl Outdated
Comment thread src/general/semidiscretization.jl Outdated
# which can significantly improve performance on GPUs.
block_size = div(64, sizeof(ELTYPE))
else
# There is no performance benefit to aligning ranges for CPU backends.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to the question above: Why not also do the alignment on the CPU? Unless there is a significant benefit from not doing this on the CPU, I'd recommend to always use the same alignment everywhere, every time. This makes it much easier to reason about differences between CPU and GPU code, and ensures that you do not accidentally screw this up on the GPU (but miss it, because your development happens on the CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding padding here means we add dummy values to the time integration that have to be computed as well. If there is no benefit on the CPU, why would I increase the workload?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you add at most 15 additional FP32 values to the time integrator per system, I'd say that's negligible.

If there is no benefit on the CPU, why would I increase the workload?

Maybe I was not clear above: From experience, I'd prefer a consistent memory layout among nearly all backends (CPU, GPU, whatever) over a potential (but very likely negligible) increase in the overall workload. I've spend too much time on OOB errors/Valgrind runs/Heisenbug chases, thus making the code simpler and harder to use wrongly is very high on my priority list 😅

Adding at most 15 FP32 values (for 64 byte alignment), which corresponds to - at most - 5 additional particles for the time integrator, I'd say that's a pretty good deal. But I'll leave the decision up to you.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LasNikas @svchb what are your thoughts?

@efaulhaber
Copy link
Copy Markdown
Member Author

efaulhaber commented Apr 20, 2026

@sloede I now ran a benchmark with the TLSPH kernel, which reads two 2x2 matrices per particle-neighbor pair that can be aligned, so we expect a larger difference in performance here. As opposed to the integration array with the padding, these are individual arrays that are always aligned on the GPU, but not guaranteed on the CPU (although in practice always aligned for large enough arrays). Note that for the non-aligned version I just added an offset of 8 bytes at the beginning.

Method aligned memory (ARM) non-aligned memory (ARM) aligned memory (RAMSES) non-aligned memory (RAMSES)
linear indexing 16.467 ms 16.480 ms 15.325 ms 15.244 ms
Cartesian indexing 15.631 ms 15.623 ms 15.543 ms 15.528 ms
vload 14.845 ms 14.871 ms 14.274 ms 14.263 ms
vloada 14.837 ms 14.272 ms
  • "linear indexing" is
    SMatrix{N, N}(ntuple(@inline(j->@inbounds A[(i - 1) * N^2 + j]), Val(N^2)))
  • "Cartesian indexing" is
    SMatrix{N, N}(ntuple(@inline(j->@inbounds A[mod(j - 1, N) + 1, div(j - 1, N) + 1, i]), Val(N^2)))
    This is slower on the GPU (adds two integer add instructions) and in micro benchmarks on the ARM CPU, but for some reason, it is faster than linear indexing in the full interactions benchmark on the ARM CPU. I don't understand why and the Julia Slack wasn't helpful yet. On RAMSES, it is slower as expected.
  • "vload" is using SIMD.vload for a vector load. This is noticably faster here on the CPU, but on the GPU it produces the same instructions as the linear indexing.
  • "vloada" is using SIMD.vloada for aligned vector loads (and hence requires alignment). On the GPU, this is the only version that produces combined load instructions that are faster. This has the same performance as vload on the CPU.

Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

@svchb
Copy link
Copy Markdown
Collaborator

svchb commented Apr 20, 2026

@sloede I now ran a benchmark with the TLSPH kernel, which reads two 2x2 matrices per particle-neighbor pair that can be aligned, so we expect a larger difference in performance here. As opposed to the integration array with the padding, these are individual arrays that are always aligned on the GPU, but not guaranteed on the CPU (although in practice always aligned for large enough arrays). Note that for the non-aligned version I just added an offset of 8 bytes at the beginning.

Method aligned memory (ARM) non-aligned memory (ARM) aligned memory (RAMSES) non-aligned memory (RAMSES)
linear indexing 16.467 ms 16.480 ms 15.325 ms 15.244 ms
Cartesian indexing 15.631 ms 15.623 ms 15.543 ms 15.528 ms
vload 14.845 ms 14.871 ms 14.274 ms 14.263 ms
vloada 14.837 ms 14.272 ms
  • "linear indexing" is
    SMatrix{N, N}(ntuple(@inline(j->@inbounds A[(i - 1) * N^2 + j]), Val(N^2)))
  • "Cartesian indexing" is
    SMatrix{N, N}(ntuple(@inline(j->@inbounds A[mod(j - 1, N) + 1, div(j - 1, N) + 1, i]), Val(N^2)))
    This is slower on the GPU (adds two integer add instructions) and in micro benchmarks on the ARM CPU, but for some reason, it is faster than linear indexing in the full interactions benchmark on the ARM CPU. I don't understand why and the Julia Slack wasn't helpful yet. On RAMSES, it is slower as expected.
  • "vload" is using SIMD.vload for a vector load. This is noticably faster here on the CPU, but on the GPU it produces the same instructions as the linear indexing.
  • "vloada" is using SIMD.vloada for aligned vector loads (and hence requires alignment). On the GPU, this is the only version that produces combined load instructions that are faster. This has the same performance as vload on the CPU.

Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

@sloede
Copy link
Copy Markdown
Member

sloede commented Apr 20, 2026

Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

Yes, that's the question. Especially in performance-critical sections, if I can get by with just a single implementation for everything, I'd prefer that. You always need to (or should) keep in mind that you're not writing this code just for yourself, but also the next generation of researchers who might have much less experience in performance engineering and probably highly value simplicity in these "infrastructure" regions of the code. But as I said, it's a question of prioritizing different objectives 🤷‍♂️

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

I agree, this would be interesting to know.

@efaulhaber
Copy link
Copy Markdown
Member Author

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

The same 4M particles benchmark that I run on the GPUs.

@efaulhaber efaulhaber mentioned this pull request Apr 20, 2026
8 tasks
@svchb
Copy link
Copy Markdown
Collaborator

svchb commented Apr 20, 2026

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

The same 4M particles benchmark that I run on the GPUs.

Hmm than modern CPUs are just so optimized that it doesn't matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking changes This change will break the public API and requires a new major release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants