Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

Merged

Conversation

samolisov
Copy link
Contributor

Closes #383

@samolisov
Copy link
Contributor Author

samolisov commented Dec 21, 2024

I use the following benchmark: dominator_tree_benchmark.cpp

On my machine (32 X 1792.7 MHz CPU s with hyper-threading and almost zero Load Average, Ubuntu 20.4) the report is the following (we may use the state after merging #407 as a base-line:

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Tarjan's paper (vertex list)                   934 ns          934 ns       748347
Tarjan's paper  (vertex vector)                845 ns          845 ns       830574
Appel. fig. 19.8 (vertex list)                 960 ns          959 ns       731191
Appel. fig. 19.8  (vertex vector)              860 ns          860 ns       813827
Muchnick. fig. 8.18 (vertex list)              561 ns          560 ns      1248586
Muchnick. fig. 8.18  (vertex vector)           538 ns          538 ns      1302725
Cytron's paper, fig. 9 (vertex list)          1145 ns         1145 ns       613263
Cytron's paper, fig. 9  (vertex vector)       1046 ns         1046 ns       674659
From a code, 186 BBs (vertex list)           12938 ns        12937 ns        54742
From a code, 186 BBs (vertex vector)         11528 ns        11527 ns        62319

After implementing a "vector-of-structs" solution, the numbers are the following:

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Tarjan's paper (vertex list)                   919 ns          919 ns       768302
Tarjan's paper  (vertex vector)                835 ns          835 ns       838532
Appel. fig. 19.8 (vertex list)                 944 ns          944 ns       739354
Appel. fig. 19.8  (vertex vector)              854 ns          854 ns       825316
Muchnick. fig. 8.18 (vertex list)              527 ns          527 ns      1285818
Muchnick. fig. 8.18  (vertex vector)           488 ns          488 ns      1433765
Cytron's paper, fig. 9 (vertex list)          1101 ns         1101 ns       636063
Cytron's paper, fig. 9  (vertex vector)       1024 ns         1024 ns       685137
From a code, 186 BBs (vertex list)           12754 ns        12753 ns        54584
From a code, 186 BBs (vertex vector)         11623 ns        11622 ns        61169

Here we can see about 1% speedup for the "large" cases (for CFGs with 186 basic blocks) and about 10% for small ones (Muchnick. fig. 8.18, 8 vertices).

I'm thinking what to deal with the semedom_ vector: whether should we put samedoms into the struct? The pattern is a little different so that some more experiments are required.

@samolisov
Copy link
Contributor Author

Maybe a check on a larger graph (up to 1000 or 2000-3000) nodes is needed to ensure there is no regression for large inputs.

@jeremy-murphy
Copy link
Contributor

Thanks for trying this change, pity it didn't yield anything significant. I still think it's a better logical design, so I'm happy to proceed with it, although I'd like to make a few style changes.
For starters, I think we can just drop the set functions on the struct. More later.

@samolisov
Copy link
Contributor Author

@jeremy-murphy Thank you for the suggestion, I've replaced every set_ method with a direct writing to the corresponding field and remove the methods.

Also, I added a benchmark for a huge (3000+ nodes) graph, on such graph I see the following situation. The baseline (code from the develop branch):

Huge Inlined Function (vertex list)         275707 ns       275683 ns         2531
Huge Inlined Function (vertex vector)       236892 ns       236878 ns         2969

With the "cache-friendly" solution:

Huge Inlined Function (vertex list)         284871 ns       284855 ns         2495
Huge Inlined Function (vertex vector)       251233 ns       251218 ns         2783

So, we can see even some performance degradation, up to 3-6%.

Copy link
Contributor

@jeremy-murphy jeremy-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!
I'd love some changes about naming, etc and one change that might improve performance.
Thank you!

@samolisov samolisov force-pushed the dominator-tree-vector-of-structs branch from ac49c66 to d7d4f42 Compare February 27, 2025 12:50
@samolisov
Copy link
Contributor Author

I've compared the baseline (the current develop branch) and the PR again on the Huge Inlined Function (vertex vector) benchmark and gathered the cache-references and cache-misses count for my CPU (AMD EPYC 7502P 32-Core Processor, 1793.628 MHz, 512 KB cache). The benchmark has been compiled with clang 20 rc2 and -O3 is used as the optimization level.

The command:

$ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./dominator_tree_benchmark_20_O3_baseline --benchmark_filter="Huge Inlined Function \(vertex vector\)"

Baseline:

       356,911,462      cache-references
        78,879,574      cache-misses              #   22.101 % of all cache refs
     1,501,651,461      branches

       359,839,124      cache-references
        78,829,953      cache-misses              #   21.907 % of all cache refs
     1,500,906,629      branches

      365,086,044      cache-references
        82,171,256      cache-misses              #   22.507 % of all cache refs
     1,521,432,528      branches

The final variant:

       343,820,294      cache-references
        78,653,556      cache-misses              #   22.876 % of all cache refs
     1,475,058,434      branches

       338,131,001      cache-references
        76,760,089      cache-misses              #   22.701 % of all cache refs
     1,462,350,259      branches

       332,844,432      cache-references
        77,863,111      cache-misses              #   23.393 % of all cache refs
     1,467,184,206      branches

The final variant leads to about 1 percentage point more cache misses, from another point of view and I have no idea why, it leads to about 3% fewer branches.

I have no answer yet why the final variant with using a vector of structs leads to even a little bit but to more cache misses on our workload (the implementation of the algorithm). My hypothesis is some irregularity in the algorithm itself: when we do not scan every vertex with its triple one by one but jumps from one to another over the half of the array but this is my attempt to just guess the answer only and not a result of any investigation. But the question is interesting on its own, as we can see, not every workload can be made better from the performance point of view just by using the vector of structs or struct of vectors. Anyway I would like thank you for the initial hypothesis to use the vector of structs: it gives me a good task to play with.

Also, the average of the subsequent 3 runs for the benchmark: 239734 ns (the final version) vs 236870 ns (baseline), baseline is slightly better, in 1.2% (interesting that the Huge Inlined Function (vertex list) benchmark where a list is used to store vertices demonstrates exactly the same ratio).

Copy link
Contributor

@jeremy-murphy jeremy-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the code change even with the apparent cost to performance. I'm curious whether the same performance change happens across compilers and CPUs. If you have time, please try GCC as well.
I'll wait for you adjacency_matrix changes or a comment from you that you're done before I finalize the review.

@samolisov samolisov force-pushed the dominator-tree-vector-of-structs branch from d7d4f42 to 57058f0 Compare March 1, 2025 14:05
@samolisov
Copy link
Contributor Author

I believe I'm done from my side (when the CI will find no errors). @jeremy-murphy could you have a look again?

@samolisov samolisov requested a review from jeremy-murphy March 1, 2025 14:09
Copy link
Contributor

@jeremy-murphy jeremy-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just requested a small change in variable name and then it's good to merge.

@samolisov
Copy link
Contributor Author

Just requested a small change in variable name and then it's good to merge.

I've renamed preds_ -> pred_, predsMap_ -> predMap_, and all of preds_of_... -> pred_of_.... Thank you for the suggestion.

@samolisov samolisov requested a review from jeremy-murphy March 4, 2025 11:35
Copy link
Contributor

@jeremy-murphy jeremy-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for making this improvement to your previous change, I think it's a good improvement in general even if it's not perfect.

@jeremy-murphy jeremy-murphy merged commit 4792e04 into boostorg:develop Mar 4, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why the implementation of Lengauer-Tarjan uses std::deque for a bucket?
2 participants