Improve memory usage in track finding postamble #908

stephenswat · 2025-03-12T16:21:51Z

The code which turns the tips of our track finding into actual tracks uses an excessive amount of memory, as it massively overallocates. Indeed, it allocates memory as though all tips have the maximum number of track states, which is unrealistic. This commit makes it so that the number of valid track states is counted instead, making the allocation more precise. In my measurements, this more than halves the memory usage of traccc on $\langle\mu\rangle = 200$ ttbar events in the ODD.

Before:

Reconstructed track parameters: 884796
Time totals:
                  File reading  5202 ms
              Event processing  4388 ms
Throughput:
              Event processing  438.829 ms/event, 2.27879 events/s

After:

Reconstructed track parameters: 884784
Time totals:
                  File reading  5128 ms
              Event processing  4352 ms
Throughput:
              Event processing  435.297 ms/event, 2.29728 events/s

Note: The SYCL changes are best-effort but I haven't tested them.

krasznaa

So, count+fill it ended up being after all. 🤔 Good! I'm very on board with doing this consistently with what we found to be the most efficient during seed finding.

But I'd really like to reduce the number of individual memory allocations that we need for this. 🤔 With a single (SoA?) container for this new payload we could get away with allocating all needed GPU memory for the counting in one go.

device/common/CMakeLists.txt

device/common/include/traccc/finding/device/build_tracks.hpp

device/common/include/traccc/finding/device/impl/count_tracks.ipp

device/cuda/src/finding/finding_algorithm.cu

beomki-yeo

I am happy with the changes as long as the CI passes

device/common/include/traccc/finding/device/impl/build_tracks.ipp

stephenswat · 2025-03-13T10:38:35Z

I was able to one-up myself, removing the counting kernel entirely, reducing the memory allocations, keeping the memory usage low, and improving performance:

Reconstructed track parameters: 884809
Time totals:
                  File reading  5144 ms
              Event processing  4230 ms
Throughput:
              Event processing  423.022 ms/event, 2.36394 events/s

krasznaa

I'm still very much on board with the direction.

Could you post a bit of a description of what you did though? I had to realize that it's a bit too much code for me to interpret from scratch. 😦 Just some main bullet points about how the algorithm finds the correct sizes now.

device/common/include/traccc/finding/device/impl/build_tracks.ipp

stephenswat · 2025-03-17T19:13:17Z

So the responsibility of the kernels in the old version of the algorithm was as follows:

propagate_to_next_surface propagates the tracks and creates tips, i.e. points where a track might have ended. Each tracks corresponds 1:1 with a surface link, which describes the head of a reversed linked list of candidates.
build_tracks follows the linked list of candidates starting at each tip, computes some properties about that track, and places all tracks (regardless of whether they make sense or not) in a very large jagged array
prune_tracks looks at the reified tracks in the big jagged vector and copies the good ones (based on the properties determined by the build_tracks kernel) into the final jagged vector.

Now, the core insight is that the checks that build_tracks performs are extremely rudimentary. It's literally just that the number of non-hole candidates is between a lower and an upper bound. All other cuts are already performed by the propagate_to_next_surface. Thus, the core of this PR is to move the checks that build_tracks does to propagate_to_next_surface, which has all the necessary information already. The core idea is that we create only tips for valid tracks, rather than tips for all possible tracks. This, of course, also makes it trivial to count the number of valid tracks and how long those tracks will be. So in the new model, the tasks are distributed as follows:

propagate_to_next_surface propagates the tracks and creates tips only for valid tracks, where the validity is defined by some very simple set of metrics. The definition of tips remain the same, but we create far fewer of them, and we record the length of each track.
build_tracks no longer does any filtering, it just follows the linked list and places track states into a much smaller jagged vector.

Much simpler, and it has far fewer kernels. Note that in this new model build_tracks basically does what prune_tracks did in the previous model.

beomki-yeo · 2025-03-17T20:02:12Z

The core idea is that we create only tips for valid tracks, rather than tips for all possible tracks.

Yeah I think this is a right way to improve 🤔

krasznaa

Thanks for the description!

Let's just make the internal vector use the correct memory resource, and this one can go ahead as far as I'm concerned.

device/cuda/src/finding/finding_algorithm.cu

The code which turns the tips of our track finding into actual tracks uses an excessive amount of memory, as it massively overallocates. Indeed, it allocates memory as though all tips have the maximum number of track states, which is unrealistic. This commit makes it so that the number of valid track states is counted instead, making the allocation more precise. In my measurements, this more than halves the memory usage of traccc on $\langle\mu\rangle = 200$ ttbar events in the ODD.

sonarqubecloud · 2025-03-31T13:43:23Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

stephenswat · 2025-04-02T09:04:47Z

This PR also fixes a bug in the efficiency measurements by the way! Wink wink nudge nudge 😉

stephenswat · 2025-04-02T11:42:06Z

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel	`348de1a`	`921c876`	Delta
`fit`	277.88 ms	256.77 ms	-7.6%
`propagate_to_next_surface`	118.17 ms	117.00 ms	-1.0%
`find_tracks`	26.65 ms	23.00 ms	-13.7%
`count_triplets`	14.16 ms	14.16 ms	-0.0%
`find_triplets`	5.98 ms	5.98 ms	0.0%
`find_doublets`	911.31 μs	893.99 μs	-1.9%
`Thrust::sort`	757.86 μs	758.34 μs	0.1%
`ccl_kernel`	682.50 μs	682.54 μs	0.0%
`count_doublets`	627.27 μs	628.01 μs	0.1%
`build_tracks`	1.10 ms	445.35 μs	-59.6%
`prune_tracks`	1.23 ms	—	—
`select_seeds`	362.74 μs	358.26 μs	-1.2%
`apply_interaction`	132.45 μs	131.94 μs	-0.4%
`update_triplet_weights`	96.16 μs	96.23 μs	0.1%
`fill_sort_keys`	68.25 μs	67.72 μs	-0.8%
`estimate_track_params`	34.31 μs	34.31 μs	0.0%
`populate_grid`	30.40 μs	30.45 μs	0.2%
`count_grid_capacities`	29.21 μs	29.18 μs	-0.1%
`form_spacepoints`	14.24 μs	14.20 μs	-0.3%
`reduce_triplet_counts`	6.71 μs	6.71 μs	0.0%
`static_kernel`	1.78 μs	1.78 μs	0.1%
`make_barcode_sequence`	992.11 ns	1.00 μs	1.3%
`fill_prefix_sum`	165.57 ns	165.51 ns	-0.0%
Total	448.93 ms	421.08 ms	-6.2%

Note

This is an automated message produced on the explicit request of a human being.

stephenswat · 2025-04-22T13:11:57Z

Reminder that this important PR has been blocked for almost a month now. 😉

beomki-yeo · 2025-04-23T04:39:26Z

@krasznaa THere are indeed many PRs pending 🤔

krasznaa

After quite some delay (sorry about that 😦), let's indeed finally get this in.

After resolving the (hopefully trivial... 🤞) merge conflicts.

stephenswat requested review from krasznaa and beomki-yeo March 12, 2025 16:21

stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL shared Changes related to shared code labels Mar 12, 2025

krasznaa requested changes Mar 12, 2025

View reviewed changes

beomki-yeo reviewed Mar 13, 2025

View reviewed changes

device/common/include/traccc/finding/device/impl/build_tracks.ipp Outdated Show resolved Hide resolved

stephenswat force-pushed the impr/track_finding_mem_usage branch 2 times, most recently from 3aa466b to ac3cc15 Compare March 13, 2025 11:35

krasznaa reviewed Mar 14, 2025

View reviewed changes

device/common/include/traccc/finding/device/impl/build_tracks.ipp Outdated Show resolved Hide resolved

stephenswat mentioned this pull request Mar 14, 2025

Fix: uninitialized value when the first step is skipped #911

Merged

stephenswat force-pushed the impr/track_finding_mem_usage branch 3 times, most recently from 6b22d43 to 629470a Compare March 17, 2025 19:04

stephenswat force-pushed the impr/track_finding_mem_usage branch from 629470a to 6148d6f Compare March 17, 2025 19:22

krasznaa requested changes Mar 25, 2025

View reviewed changes

device/cuda/src/finding/finding_algorithm.cu Show resolved Hide resolved

stephenswat force-pushed the impr/track_finding_mem_usage branch from 6148d6f to 29b4083 Compare March 27, 2025 10:18

stephenswat mentioned this pull request Mar 31, 2025

Do not store links which will be rejected #933

Open

stephenswat force-pushed the impr/track_finding_mem_usage branch from 29b4083 to 921c876 Compare March 31, 2025 13:42

krasznaa mentioned this pull request Apr 1, 2025

Jagged Vector Buffer Fix, main branch (2025.04.01.) acts-project/vecmem#319

Merged

krasznaa approved these changes May 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage in track finding postamble #908

Improve memory usage in track finding postamble #908

stephenswat commented Mar 12, 2025

krasznaa left a comment

beomki-yeo left a comment

stephenswat commented Mar 13, 2025

krasznaa left a comment

stephenswat commented Mar 17, 2025

beomki-yeo commented Mar 17, 2025

krasznaa left a comment

sonarqubecloud bot commented Mar 31, 2025

stephenswat commented Apr 2, 2025

stephenswat commented Apr 2, 2025

stephenswat commented Apr 22, 2025

beomki-yeo commented Apr 23, 2025

krasznaa left a comment

Improve memory usage in track finding postamble #908

Are you sure you want to change the base?

Improve memory usage in track finding postamble #908

Conversation

stephenswat commented Mar 12, 2025

Before:

After:

krasznaa left a comment

Choose a reason for hiding this comment

beomki-yeo left a comment

Choose a reason for hiding this comment

stephenswat commented Mar 13, 2025

krasznaa left a comment

Choose a reason for hiding this comment

stephenswat commented Mar 17, 2025

beomki-yeo commented Mar 17, 2025

krasznaa left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 31, 2025

Quality Gate passed

stephenswat commented Apr 2, 2025

stephenswat commented Apr 2, 2025

Performance summary

Graphical

Tabular

stephenswat commented Apr 22, 2025

beomki-yeo commented Apr 23, 2025

krasznaa left a comment

Choose a reason for hiding this comment