Use combined vector loads on GPUs by efaulhaber · Pull Request #1147 · trixi-framework/TrixiParticles.jl

efaulhaber · 2026-04-17T12:25:26Z

Based on #1116.

This reverts commit cdd58e2.

This reverts commit 1243fc6.

codecov · 2026-04-19T11:04:26Z

Codecov Report

❌ Patch coverage is 60.56338% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.68%. Comparing base (a3f1139) to head (ba48cc1).

Files with missing lines	Patch %	Lines
src/general/gpu.jl	0.00%	25 Missing ⚠️
src/schemes/fluid/weakly_compressible_sph/rhs.jl	72.91%	13 Missing ⚠️
src/schemes/structure/total_lagrangian_sph/rhs.jl	50.00%	11 Missing ⚠️
src/general/abstract_system.jl	85.71%	2 Missing ⚠️
src/general/neighborhood_search.jl	0.00%	2 Missing ⚠️
test/examples/gpu.jl	0.00%	2 Missing ⚠️
src/general/semidiscretization.jl	95.45%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1147      +/-   ##
==========================================
- Coverage   89.17%   88.68%   -0.50%     
==========================================
  Files         128      129       +1     
  Lines        9925    10011      +86     
==========================================
+ Hits         8851     8878      +27     
- Misses       1074     1133      +59

Flag	Coverage Δ
total	`88.68% <60.56%> (-0.50%)`	⬇️
unit	`67.36% <56.33%> (-0.26%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

efaulhaber · 2026-04-19T12:12:47Z

/run-gpu-tests

efaulhaber · 2026-04-19T14:19:48Z

/run-gpu-tests

efaulhaber · 2026-04-19T14:25:28Z

/run-gpu-tests

sloede · 2026-04-19T14:26:18Z

+        # which can significantly improve performance on GPUs.
+        block_size = div(64, sizeof(ELTYPE))
+    else
+        # There is no performance benefit to aligning ranges for CPU backends.


Orthogonal to the question above: Why not also do the alignment on the CPU? Unless there is a significant benefit from not doing this on the CPU, I'd recommend to always use the same alignment everywhere, every time. This makes it much easier to reason about differences between CPU and GPU code, and ensures that you do not accidentally screw this up on the GPU (but miss it, because your development happens on the CPU).

Adding padding here means we add dummy values to the time integration that have to be computed as well. If there is no benefit on the CPU, why would I increase the workload?

Given that you add at most 15 additional FP32 values to the time integrator per system, I'd say that's negligible.

If there is no benefit on the CPU, why would I increase the workload?

Maybe I was not clear above: From experience, I'd prefer a consistent memory layout among nearly all backends (CPU, GPU, whatever) over a potential (but very likely negligible) increase in the overall workload. I've spend too much time on OOB errors/Valgrind runs/Heisenbug chases, thus making the code simpler and harder to use wrongly is very high on my priority list 😅

Adding at most 15 FP32 values (for 64 byte alignment), which corresponds to - at most - 5 additional particles for the time integrator, I'd say that's a pretty good deal. But I'll leave the decision up to you.

@LasNikas @svchb what are your thoughts?

efaulhaber · 2026-04-20T10:27:20Z

@sloede I now ran a benchmark with the TLSPH kernel, which reads two 2x2 matrices per particle-neighbor pair that can be aligned, so we expect a larger difference in performance here. As opposed to the integration array with the padding, these are individual arrays that are always aligned on the GPU, but not guaranteed on the CPU (although in practice always aligned for large enough arrays). Note that for the non-aligned version I just added an offset of 8 bytes at the beginning.

Method	aligned memory (ARM)	non-aligned memory (ARM)	aligned memory (RAMSES)	non-aligned memory (RAMSES)
linear indexing	16.467 ms	16.480 ms	15.325 ms	15.244 ms
Cartesian indexing	15.631 ms	15.623 ms	15.543 ms	15.528 ms
vload	14.845 ms	14.871 ms	14.274 ms	14.263 ms
vloada	14.837 ms	—	14.272 ms	—

"linear indexing" is

SMatrix{N, N}(ntuple(@inline(j->@inbounds A[(i - 1) * N^2 + j]), Val(N^2)))

"Cartesian indexing" is
```
SMatrix{N, N}(ntuple(@inline(j->@inbounds A[mod(j - 1, N) + 1, div(j - 1, N) + 1, i]), Val(N^2)))
```
This is slower on the GPU (adds two integer add instructions) and in micro benchmarks on the ARM CPU, but for some reason, it is faster than linear indexing in the full interactions benchmark on the ARM CPU. I don't understand why and the Julia Slack wasn't helpful yet. On RAMSES, it is slower as expected.
"vload" is using SIMD.vload for a vector load. This is noticably faster here on the CPU, but on the GPU it produces the same instructions as the linear indexing.
"vloada" is using SIMD.vloada for aligned vector loads (and hence requires alignment). On the GPU, this is the only version that produces combined load instructions that are faster. This has the same performance as vload on the CPU.

Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

svchb · 2026-04-20T10:38:47Z

@sloede I now ran a benchmark with the TLSPH kernel, which reads two 2x2 matrices per particle-neighbor pair that can be aligned, so we expect a larger difference in performance here. As opposed to the integration array with the padding, these are individual arrays that are always aligned on the GPU, but not guaranteed on the CPU (although in practice always aligned for large enough arrays). Note that for the non-aligned version I just added an offset of 8 bytes at the beginning.

Method aligned memory (ARM) non-aligned memory (ARM) aligned memory (RAMSES) non-aligned memory (RAMSES)

linear indexing 16.467 ms 16.480 ms 15.325 ms 15.244 ms

Cartesian indexing 15.631 ms 15.623 ms 15.543 ms 15.528 ms

vload 14.845 ms 14.871 ms 14.274 ms 14.263 ms

vloada 14.837 ms — 14.272 ms —
"linear indexing" is
SMatrix{N, N}(ntuple(@inline(j->@inbounds A[(i - 1) * N^2 + j]), Val(N^2)))
"Cartesian indexing" is
SMatrix{N, N}(ntuple(@inline(j->@inbounds A[mod(j - 1, N) + 1, div(j - 1, N) + 1, i]), Val(N^2)))
This is slower on the GPU (adds two integer add instructions) and in micro benchmarks on the ARM CPU, but for some reason, it is faster than linear indexing in the full interactions benchmark on the ARM CPU. I don't understand why and the Julia Slack wasn't helpful yet. On RAMSES, it is slower as expected.
"vload" is using SIMD.vload for a vector load. This is noticably faster here on the CPU, but on the GPU it produces the same instructions as the linear indexing.

"vloada" is using SIMD.vloada for aligned vector loads (and hence requires alignment). On the GPU, this is the only version that produces combined load instructions that are faster. This has the same performance as vload on the CPU.
Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

sloede · 2026-04-20T11:14:45Z

Since vload is just as fast as vloada and alignment doesn't make a difference in performance, I conclude that we don't need to worry about alignment on the CPU and should use vload instead. The only question is if we want to use the same padding on CPUs for consistency reasons only.

Yes, that's the question. Especially in performance-critical sections, if I can get by with just a single implementation for everything, I'd prefer that. You always need to (or should) keep in mind that you're not writing this code just for yourself, but also the next generation of researchers who might have much less experience in performance engineering and probably highly value simplicity in these "infrastructure" regions of the code. But as I said, it's a question of prioritizing different objectives 🤷‍♂️

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

I agree, this would be interesting to know.

efaulhaber · 2026-04-20T11:45:27Z

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

The same 4M particles benchmark that I run on the GPUs.

svchb · 2026-04-20T15:53:24Z

Have you tested this on CPU on a sufficiently large benchmark? CPUs do alot of hiding latency when you are not escaping the L1/L2 Cache size.

The same 4M particles benchmark that I run on the GPUs.

Hmm than modern CPUs are just so optimized that it doesn't matter.

efaulhaber added 17 commits April 14, 2026 17:44

Rewrite fluid interact kernel

4597617

Move mean density calculation into free surface calculation

43ecae4

Use combined wide load for velocity and density in 3D

4eee4e2

Fix tests and add comments

93e8344

Fix velocity_and_density

8247596

Reformat

c0da10c

Remove PointNeighbors bounds checks on GPUs

455dd56

Use vloada on both CPUs and GPUs

773da40

Improve comments

20c77c5

Make SIMD load velocity_and_density safe and add tests

a3f42a6

Remove vectorized loads from this PR

1243fc6

Remove SIMD.jl dependency

cdd58e2

Merge branch 'main' into performance-fluid

c325376

Avoid duplicate velocity load in dv_shifting

3b3de2d

Fix

f43746c

Fix

d3dce78

Fix EDAC

e7369d0

efaulhaber self-assigned this Apr 17, 2026

efaulhaber added the breaking changes This change will break the public API and requires a new major release label Apr 17, 2026

efaulhaber mentioned this pull request Apr 17, 2026

Merge dev to main #1146

Open

6 tasks

efaulhaber added 6 commits April 17, 2026 15:04

Add SIMD.jl dependency

996b24e

This reverts commit cdd58e2.

Remove vectorized loads for WCSPH velocity and density

0a18b3d

This reverts commit 1243fc6.

Add SIMD load extract_smatrix_aligned

30bad03

Make aligned loads nicer

1be19ab

Automatically add padding to make aligned loads possible.

1e5dc85

Fix tests

f8b6629

efaulhaber force-pushed the vload branch from 1027e08 to f8b6629 Compare April 19, 2026 10:36

Update comment

02ff68f

Only align integration arrays on the GPU

122a6b7

efaulhaber added 4 commits April 19, 2026 13:09

Merge branch 'main' into vload

20e2f67

Use aligned loads for TLSPH 2D matrices

702ec42

Add and fix tests

71e5390

Reformat

7bca21a

Fix tests

64cbf2d

Fix tests

3296dda

sloede reviewed Apr 19, 2026

View reviewed changes

efaulhaber added 3 commits April 20, 2026 09:32

Add test for precompilation warnings.

38353d2

Fix

96106f7

Try again

8ed2887

efaulhaber added 6 commits April 20, 2026 14:16

Fix tests

c2c9adc

Always align integration arrays

8d23b67

Merge branch 'main' into vload

150bc31

Use vload in 2D on the CPU as well

0676638

Add comment

05c91d6

Fix

ba48cc1

efaulhaber mentioned this pull request Apr 20, 2026

3x Speedup on GPUs: Checklist #1131

Open

8 tasks

Conversation

efaulhaber commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

efaulhaber commented Apr 19, 2026

Uh oh!

efaulhaber commented Apr 19, 2026

Uh oh!

efaulhaber commented Apr 19, 2026

Uh oh!

Uh oh!

sloede Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

sloede Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svchb commented Apr 20, 2026

Uh oh!

sloede commented Apr 20, 2026

Uh oh!

efaulhaber commented Apr 20, 2026

Uh oh!

svchb commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

efaulhaber commented Apr 17, 2026 •

edited

Loading

codecov bot commented Apr 19, 2026 •

edited

Loading

efaulhaber commented Apr 20, 2026 •

edited

Loading