Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

samolisov · 2024-12-21T04:44:50Z

Closes #383

samolisov · 2024-12-21T04:49:47Z

I use the following benchmark: dominator_tree_benchmark.cpp

On my machine (32 X 1792.7 MHz CPU s with hyper-threading and almost zero Load Average, Ubuntu 20.4) the report is the following (we may use the state after merging #407 as a base-line:

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Tarjan's paper (vertex list)                   934 ns          934 ns       748347
Tarjan's paper  (vertex vector)                845 ns          845 ns       830574
Appel. fig. 19.8 (vertex list)                 960 ns          959 ns       731191
Appel. fig. 19.8  (vertex vector)              860 ns          860 ns       813827
Muchnick. fig. 8.18 (vertex list)              561 ns          560 ns      1248586
Muchnick. fig. 8.18  (vertex vector)           538 ns          538 ns      1302725
Cytron's paper, fig. 9 (vertex list)          1145 ns         1145 ns       613263
Cytron's paper, fig. 9  (vertex vector)       1046 ns         1046 ns       674659
From a code, 186 BBs (vertex list)           12938 ns        12937 ns        54742
From a code, 186 BBs (vertex vector)         11528 ns        11527 ns        62319

After implementing a "vector-of-structs" solution, the numbers are the following:

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Tarjan's paper (vertex list)                   919 ns          919 ns       768302
Tarjan's paper  (vertex vector)                835 ns          835 ns       838532
Appel. fig. 19.8 (vertex list)                 944 ns          944 ns       739354
Appel. fig. 19.8  (vertex vector)              854 ns          854 ns       825316
Muchnick. fig. 8.18 (vertex list)              527 ns          527 ns      1285818
Muchnick. fig. 8.18  (vertex vector)           488 ns          488 ns      1433765
Cytron's paper, fig. 9 (vertex list)          1101 ns         1101 ns       636063
Cytron's paper, fig. 9  (vertex vector)       1024 ns         1024 ns       685137
From a code, 186 BBs (vertex list)           12754 ns        12753 ns        54584
From a code, 186 BBs (vertex vector)         11623 ns        11622 ns        61169

Here we can see about 1% speedup for the "large" cases (for CFGs with 186 basic blocks) and about 10% for small ones (Muchnick. fig. 8.18, 8 vertices).

I'm thinking what to deal with the semedom_ vector: whether should we put samedoms into the struct? The pattern is a little different so that some more experiments are required.

samolisov · 2024-12-21T05:20:48Z

Maybe a check on a larger graph (up to 1000 or 2000-3000) nodes is needed to ensure there is no regression for large inputs.

jeremy-murphy · 2024-12-21T06:09:42Z

Thanks for trying this change, pity it didn't yield anything significant. I still think it's a better logical design, so I'm happy to proceed with it, although I'd like to make a few style changes.
For starters, I think we can just drop the set functions on the struct. More later.

samolisov · 2024-12-29T14:39:29Z

@jeremy-murphy Thank you for the suggestion, I've replaced every set_ method with a direct writing to the corresponding field and remove the methods.

Also, I added a benchmark for a huge (3000+ nodes) graph, on such graph I see the following situation. The baseline (code from the develop branch):

Huge Inlined Function (vertex list)         275707 ns       275683 ns         2531
Huge Inlined Function (vertex vector)       236892 ns       236878 ns         2969

With the "cache-friendly" solution:

Huge Inlined Function (vertex list)         284871 ns       284855 ns         2495
Huge Inlined Function (vertex vector)       251233 ns       251218 ns         2783

So, we can see even some performance degradation, up to 3-6%.

jeremy-murphy

Thanks for your patience!
I'd love some changes about naming, etc and one change that might improve performance.
Thank you!

include/boost/graph/dominator_tree.hpp

samolisov · 2025-02-27T14:42:29Z

I've compared the baseline (the current develop branch) and the PR again on the Huge Inlined Function (vertex vector) benchmark and gathered the cache-references and cache-misses count for my CPU (AMD EPYC 7502P 32-Core Processor, 1793.628 MHz, 512 KB cache). The benchmark has been compiled with clang 20 rc2 and -O3 is used as the optimization level.

The command:

$ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./dominator_tree_benchmark_20_O3_baseline --benchmark_filter="Huge Inlined Function \(vertex vector\)"

Baseline:

       356,911,462      cache-references
        78,879,574      cache-misses              #   22.101 % of all cache refs
     1,501,651,461      branches

       359,839,124      cache-references
        78,829,953      cache-misses              #   21.907 % of all cache refs
     1,500,906,629      branches

      365,086,044      cache-references
        82,171,256      cache-misses              #   22.507 % of all cache refs
     1,521,432,528      branches

The final variant:

       343,820,294      cache-references
        78,653,556      cache-misses              #   22.876 % of all cache refs
     1,475,058,434      branches

       338,131,001      cache-references
        76,760,089      cache-misses              #   22.701 % of all cache refs
     1,462,350,259      branches

       332,844,432      cache-references
        77,863,111      cache-misses              #   23.393 % of all cache refs
     1,467,184,206      branches

The final variant leads to about 1 percentage point more cache misses, from another point of view and I have no idea why, it leads to about 3% fewer branches.

I have no answer yet why the final variant with using a vector of structs leads to even a little bit but to more cache misses on our workload (the implementation of the algorithm). My hypothesis is some irregularity in the algorithm itself: when we do not scan every vertex with its triple one by one but jumps from one to another over the half of the array but this is my attempt to just guess the answer only and not a result of any investigation. But the question is interesting on its own, as we can see, not every workload can be made better from the performance point of view just by using the vector of structs or struct of vectors. Anyway I would like thank you for the initial hypothesis to use the vector of structs: it gives me a good task to play with.

Also, the average of the subsequent 3 runs for the benchmark: 239734 ns (the final version) vs 236870 ns (baseline), baseline is slightly better, in 1.2% (interesting that the Huge Inlined Function (vertex list) benchmark where a list is used to store vertices demonstrates exactly the same ratio).

jeremy-murphy

I'm happy with the code change even with the apparent cost to performance. I'm curious whether the same performance change happens across compilers and CPUs. If you have time, please try GCC as well.
I'll wait for you adjacency_matrix changes or a comment from you that you're done before I finalize the review.

include/boost/graph/dominator_tree.hpp

Closes boostorg#383

This specialization of the adjacency_list class uses different value for `graph_traits< G >::null_vertex()`.

samolisov · 2025-03-01T14:09:05Z

I believe I'm done from my side (when the CI will find no errors). @jeremy-murphy could you have a look again?

jeremy-murphy

Just requested a small change in variable name and then it's good to merge.

samolisov · 2025-03-04T10:43:44Z

Just requested a small change in variable name and then it's good to merge.

I've renamed preds_ -> pred_, predsMap_ -> predMap_, and all of preds_of_... -> pred_of_.... Thank you for the suggestion.

jeremy-murphy

Thanks so much for making this improvement to your previous change, I think it's a good improvement in general even if it's not perfect.

jeremy-murphy requested changes Feb 25, 2025

View reviewed changes

samolisov force-pushed the dominator-tree-vector-of-structs branch from ac49c66 to d7d4f42 Compare February 27, 2025 12:50

jeremy-murphy reviewed Feb 28, 2025

View reviewed changes

include/boost/graph/dominator_tree.hpp Outdated Show resolved Hide resolved

include/boost/graph/dominator_tree.hpp Outdated Show resolved Hide resolved

samolisov added 7 commits March 1, 2025 14:21

Use vector-of-structs of preds/semi for Lengauer-Tarjan

8119d0e

Closes boostorg#383

Remove the 'set_' members from 'preds' struct

3de9380

Address the comments from @jeremy-murphy

e121340

Closes boostorg#383

.clang-format Enable adding a blank line between template and class name

561e377

Add a test case for adjacency_list< listS, vecS, bidirectionalS >

38b4f31

This specialization of the adjacency_list class uses different value for `graph_traits< G >::null_vertex()`.

Add a test case for adjacency_matrix

e936b09

Combine the preds_of_n's components update into a whole

57058f0

samolisov force-pushed the dominator-tree-vector-of-structs branch from d7d4f42 to 57058f0 Compare March 1, 2025 14:05

samolisov requested a review from jeremy-murphy March 1, 2025 14:09

jeremy-murphy requested changes Mar 3, 2025

View reviewed changes

Rename preds -> pred

7b04334

samolisov requested a review from jeremy-murphy March 4, 2025 11:35

jeremy-murphy approved these changes Mar 4, 2025

View reviewed changes

jeremy-murphy merged commit 4792e04 into boostorg:develop Mar 4, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

Uh oh!

samolisov commented Dec 21, 2024

Uh oh!

samolisov commented Dec 21, 2024 •

edited

Loading

Uh oh!

samolisov commented Dec 21, 2024

Uh oh!

jeremy-murphy commented Dec 21, 2024

Uh oh!

samolisov commented Dec 29, 2024

Uh oh!

jeremy-murphy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samolisov commented Feb 27, 2025

Uh oh!

jeremy-murphy left a comment

Uh oh!

Uh oh!

Uh oh!

samolisov commented Mar 1, 2025

Uh oh!

jeremy-murphy left a comment

Uh oh!

samolisov commented Mar 4, 2025

Uh oh!

jeremy-murphy left a comment

Uh oh!

Uh oh!

Uh oh!

Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

Use vector-of-structs of preds/semi for Lengauer-Tarjan #408

Uh oh!

Conversation

samolisov commented Dec 21, 2024

Uh oh!

samolisov commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samolisov commented Dec 21, 2024

Uh oh!

jeremy-murphy commented Dec 21, 2024

Uh oh!

samolisov commented Dec 29, 2024

Uh oh!

jeremy-murphy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samolisov commented Feb 27, 2025

Uh oh!

jeremy-murphy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

samolisov commented Mar 1, 2025

Uh oh!

jeremy-murphy left a comment

Choose a reason for hiding this comment

Uh oh!

samolisov commented Mar 4, 2025

Uh oh!

jeremy-murphy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

samolisov commented Dec 21, 2024 •

edited

Loading