Skip to content

Improve memory usage in track finding postamble #908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stephenswat
Copy link
Member

The code which turns the tips of our track finding into actual tracks uses an excessive amount of memory, as it massively overallocates. Indeed, it allocates memory as though all tips have the maximum number of track states, which is unrealistic. This commit makes it so that the number of valid track states is counted instead, making the allocation more precise. In my measurements, this more than halves the memory usage of traccc on $\langle\mu\rangle = 200$ ttbar events in the ODD.

Before:

6fee5ca3

Reconstructed track parameters: 884796
Time totals:
                  File reading  5202 ms
              Event processing  4388 ms
Throughput:
              Event processing  438.829 ms/event, 2.27879 events/s

After:

a91e94d0

Reconstructed track parameters: 884784
Time totals:
                  File reading  5128 ms
              Event processing  4352 ms
Throughput:
              Event processing  435.297 ms/event, 2.29728 events/s

Note: The SYCL changes are best-effort but I haven't tested them.

@stephenswat stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL shared Changes related to shared code labels Mar 12, 2025
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, count+fill it ended up being after all. 🤔 Good! I'm very on board with doing this consistently with what we found to be the most efficient during seed finding.

But I'd really like to reduce the number of individual memory allocations that we need for this. 🤔 With a single (SoA?) container for this new payload we could get away with allocating all needed GPU memory for the counting in one go.

Copy link
Contributor

@beomki-yeo beomki-yeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the changes as long as the CI passes

@stephenswat
Copy link
Member Author

I was able to one-up myself, removing the counting kernel entirely, reducing the memory allocations, keeping the memory usage low, and improving performance:

Reconstructed track parameters: 884809
Time totals:
                  File reading  5144 ms
              Event processing  4230 ms
Throughput:
              Event processing  423.022 ms/event, 2.36394 events/s

@stephenswat stephenswat force-pushed the impr/track_finding_mem_usage branch 2 times, most recently from 3aa466b to ac3cc15 Compare March 13, 2025 11:35
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still very much on board with the direction.

Could you post a bit of a description of what you did though? I had to realize that it's a bit too much code for me to interpret from scratch. 😦 Just some main bullet points about how the algorithm finds the correct sizes now.

@stephenswat stephenswat force-pushed the impr/track_finding_mem_usage branch 3 times, most recently from 6b22d43 to 629470a Compare March 17, 2025 19:04
@stephenswat
Copy link
Member Author

So the responsibility of the kernels in the old version of the algorithm was as follows:

  • propagate_to_next_surface propagates the tracks and creates tips, i.e. points where a track might have ended. Each tracks corresponds 1:1 with a surface link, which describes the head of a reversed linked list of candidates.
  • build_tracks follows the linked list of candidates starting at each tip, computes some properties about that track, and places all tracks (regardless of whether they make sense or not) in a very large jagged array
  • prune_tracks looks at the reified tracks in the big jagged vector and copies the good ones (based on the properties determined by the build_tracks kernel) into the final jagged vector.

Now, the core insight is that the checks that build_tracks performs are extremely rudimentary. It's literally just that the number of non-hole candidates is between a lower and an upper bound. All other cuts are already performed by the propagate_to_next_surface. Thus, the core of this PR is to move the checks that build_tracks does to propagate_to_next_surface, which has all the necessary information already. The core idea is that we create only tips for valid tracks, rather than tips for all possible tracks. This, of course, also makes it trivial to count the number of valid tracks and how long those tracks will be. So in the new model, the tasks are distributed as follows:

  • propagate_to_next_surface propagates the tracks and creates tips only for valid tracks, where the validity is defined by some very simple set of metrics. The definition of tips remain the same, but we create far fewer of them, and we record the length of each track.
  • build_tracks no longer does any filtering, it just follows the linked list and places track states into a much smaller jagged vector.

Much simpler, and it has far fewer kernels. Note that in this new model build_tracks basically does what prune_tracks did in the previous model.

@stephenswat stephenswat force-pushed the impr/track_finding_mem_usage branch from 629470a to 6148d6f Compare March 17, 2025 19:22
@beomki-yeo
Copy link
Contributor

The core idea is that we create only tips for valid tracks, rather than tips for all possible tracks.

Yeah I think this is a right way to improve 🤔

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the description!

Let's just make the internal vector use the correct memory resource, and this one can go ahead as far as I'm concerned.

The code which turns the tips of our track finding into actual tracks
uses an excessive amount of memory, as it massively overallocates.
Indeed, it allocates memory as though all tips have the maximum number
of track states, which is unrealistic. This commit makes it so that the
number of valid track states is counted instead, making the allocation
more precise. In my measurements, this more than halves the memory usage
of traccc on $\langle\mu\rangle = 200$ ttbar events in the ODD.
@stephenswat stephenswat force-pushed the impr/track_finding_mem_usage branch from 29b4083 to 921c876 Compare March 31, 2025 13:42
@stephenswat
Copy link
Member Author

This PR also fixes a bug in the efficiency measurements by the way! Wink wink nudge nudge 😉

@stephenswat
Copy link
Member Author

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel 348de1a 921c876 Delta
fit 277.88 ms 256.77 ms -7.6%
propagate_to_next_surface 118.17 ms 117.00 ms -1.0%
find_tracks 26.65 ms 23.00 ms -13.7%
count_triplets 14.16 ms 14.16 ms -0.0%
find_triplets 5.98 ms 5.98 ms 0.0%
find_doublets 911.31 μs 893.99 μs -1.9%
Thrust::sort 757.86 μs 758.34 μs 0.1%
ccl_kernel 682.50 μs 682.54 μs 0.0%
count_doublets 627.27 μs 628.01 μs 0.1%
build_tracks 1.10 ms 445.35 μs -59.6%
prune_tracks 1.23 ms
select_seeds 362.74 μs 358.26 μs -1.2%
apply_interaction 132.45 μs 131.94 μs -0.4%
update_triplet_weights 96.16 μs 96.23 μs 0.1%
fill_sort_keys 68.25 μs 67.72 μs -0.8%
estimate_track_params 34.31 μs 34.31 μs 0.0%
populate_grid 30.40 μs 30.45 μs 0.2%
count_grid_capacities 29.21 μs 29.18 μs -0.1%
form_spacepoints 14.24 μs 14.20 μs -0.3%
reduce_triplet_counts 6.71 μs 6.71 μs 0.0%
static_kernel 1.78 μs 1.78 μs 0.1%
make_barcode_sequence 992.11 ns 1.00 μs 1.3%
fill_prefix_sum 165.57 ns 165.51 ns -0.0%
Total 448.93 ms 421.08 ms -6.2%

Note

This is an automated message produced on the explicit request of a human being.

@stephenswat
Copy link
Member Author

Reminder that this important PR has been blocked for almost a month now. 😉

@beomki-yeo
Copy link
Contributor

@krasznaa THere are indeed many PRs pending 🤔

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After quite some delay (sorry about that 😦), let's indeed finally get this in.

After resolving the (hopefully trivial... 🤞) merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA improvement Improve an existing feature shared Changes related to shared code sycl Changes related to SYCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants