Skip to content

[llvm-project] Backport upstream version of lazy template loading #17722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 8, 2025

Conversation

devajithvs
Copy link
Contributor

@devajithvs devajithvs commented Feb 14, 2025

This Pull request:

Use the upstream version of our downstream patch: root-project/llvm-project@a11e943

Ran a few benchmark tests locally, ctests are slightly slower on mine (could just be noise, but nothing concerning)

Will update LLVM tags later once everything is running and good.

This will allow better testing of this patch and will not be coupled with future LLVM20 updates.

Changes or fixes:

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

This PR fixes #

@devajithvs devajithvs self-assigned this Feb 14, 2025
@devajithvs devajithvs added the clean build Ask CI to do non-incremental build on PR label Feb 14, 2025
Copy link

github-actions bot commented Feb 14, 2025

Test Results

    19 files      19 suites   5d 6h 15m 50s ⏱️
 2 715 tests  2 714 ✅ 0 💤 1 ❌
49 872 runs  49 869 ✅ 0 💤 3 ❌

For more details on these failures, see this check.

Results for commit deb3e7a.

♻️ This comment has been updated with latest results.

@devajithvs
Copy link
Contributor Author

devajithvs commented Feb 24, 2025

Results of
/usr/bin/time --verbose bin/root -l -b -q -e 'std::vector<int> vec = {1, 2, 3, 4, 5};'

This PR:

Command exited with non-zero status 255
        Command being timed: "bin/root -l -b -q -e std::vector<int> vec = {1, 2, 3, 4, 5};"
        User time (seconds): 0.12
        System time (seconds): 0.07
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 321168
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 17848
        Voluntary context switches: 118
        Involuntary context switches: 4
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 255

Master:

Command exited with non-zero status 255
        Command being timed: "bin/root -l -b -q -e std::vector<int> vec = {1, 2, 3, 4, 5};"
        User time (seconds): 0.12
        System time (seconds): 0.06
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.18
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 228812
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 15272
        Voluntary context switches: 119
        Involuntary context switches: 4
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 255

Without upstream/downstream patch:

Command exited with non-zero status 255
        Command being timed: "bin/root -l -b -q -e std::vector<int> vec = {1, 2, 3, 4, 5};"
        User time (seconds): 0.18
        System time (seconds): 0.08
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.27
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 375760
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 22146
        Voluntary context switches: 117
        Involuntary context switches: 4
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 255

@devajithvs devajithvs marked this pull request as ready for review February 24, 2025 13:17
@devajithvs devajithvs requested a review from hahnjo February 24, 2025 13:17
@vgvassilev
Copy link
Member

That’s a bummer, we screwed up something badly upstream…

@vgvassilev
Copy link
Member

@ChuanqiXu9, do you see something obvious that went wrong here? In our current workflows we deserialize 100Mb more with our upstream patch...

@hahnjo
Copy link
Member

hahnjo commented Feb 25, 2025

Repeating the test of #14495 (comment) with hsimple.C shows that (on my machine) master uses 290MB, reverting our downstream version of D41416 uses 425MB, and this PR with the upstream patch uses 414MB. Unfortunately, the performance of the patch on the ROOT side was not tracked during upstream review / changes...

@hahnjo hahnjo force-pushed the dev.lazytemplate branch 2 times, most recently from 6ab17c5 to de0d97c Compare March 6, 2025 21:14
@hahnjo
Copy link
Member

hahnjo commented Mar 7, 2025

Hi @ChuanqiXu9 @vgvassilev, I had another look at the patch and found four places where things seem wrong. At least two of them are needed for ROOT, and with the last commit I pushed to this PR memory usage returns to the same level as master in my measurements.

  1. Handling of inner template arguments in TemplateArgumentHasher: When debugging I noticed that all template arguments that were instantiations of std::pair hashed to the same value, no matter its template arguments. This is quite bad for the STL where internally I think for example maps are implemented using nodes of std::pair<Key, Value>. This can be fixed in the same manner as ODRHash handles it:
diff --git a/interpreter/llvm-project/clang/lib/Serialization/TemplateArgumentHasher.cpp b/interpreter/llvm-project/clang/lib/Serialization/TemplateArgumentHasher.cpp
index fb62454643..52c3e5ed1d 100644
--- a/interpreter/llvm-project/clang/lib/Serialization/TemplateArgumentHasher.cpp
+++ b/interpreter/llvm-project/clang/lib/Serialization/TemplateArgumentHasher.cpp
@@ -196,6 +196,21 @@ void TemplateArgumentHasher::AddDecl(const Decl *D) {
   }
 
   AddDeclarationName(ND->getDeclName());
+
+  // If this was a specialization we should take into account its template
+  // arguments. This helps to reduce collisions coming when visiting template
+  // specialization types (eg. when processing type template arguments).
+  ArrayRef<TemplateArgument> Args;
+  if (auto *CTSD = dyn_cast<ClassTemplateSpecializationDecl>(D))
+    Args = CTSD->getTemplateArgs().asArray();
+  else if (auto *VTSD = dyn_cast<VarTemplateSpecializationDecl>(D))
+    Args = VTSD->getTemplateArgs().asArray();
+  else if (auto *FD = dyn_cast<FunctionDecl>(D))
+    if (FD->getTemplateSpecializationArgs())
+      Args = FD->getTemplateSpecializationArgs()->asArray();
+
+  for (auto &TA : Args)
+    AddTemplateArgument(TA);
 }
 
 void TemplateArgumentHasher::AddQualType(QualType T) {
  1. In ASTReader::CompleteRedeclChain there is a comment saying "For partitial specialization, load all the specializations for safety." which, after fixing 1., is the remaining reason why the upstream version is loading many more templates. The fix I applied is
diff --git a/interpreter/llvm-project/clang/lib/Serialization/ASTReader.cpp b/interpreter/llvm-project/clang/lib/Serialization/ASTReader.cpp
index 3d325774ba..d65f329ae8 100644
--- a/interpreter/llvm-project/clang/lib/Serialization/ASTReader.cpp
+++ b/interpreter/llvm-project/clang/lib/Serialization/ASTReader.cpp
@@ -7723,14 +7723,8 @@ void ASTReader::CompleteRedeclChain(const Decl *D) {
     }
   }
 
-  if (Template) {
-    // For partitial specialization, load all the specializations for safety.
-    if (isa<ClassTemplatePartialSpecializationDecl,
-            VarTemplatePartialSpecializationDecl>(D))
-      Template->loadLazySpecializationsImpl();
-    else
-      Template->loadLazySpecializationsImpl(Args);
-  }
+  if (Template)
+    Template->loadLazySpecializationsImpl(Args);
 }
 
 CXXCtorInitializer **

basically reverting what was done as part of the original D41416. Do you remember why that change was done during the upstream review process?

  1. There's a similar comment in RedeclarableTemplateDecl::loadLazySpecializationsImpl about loading all specializations, which also seems dubious to me.
  2. TemplateArgumentHasher has logic to "bail out". While I agree that this is sound to do, it doesn't make sense to me to return a fixed value if one of the internal parts cannot be hashed. Shouldn't we just be able to skip that part and carry on with the hashing process? If that causes a collision then so be it, but at least we are not causing collisions for any template argument list that has a part we cannot hash.

I pushed my (suspected) solutions for the first two points to this PR, and all ROOT tests seem to pass. I can submit patches for any of the above points upstream if you want, in the way that is most convenient (with a single PR, maybe that's easier to test?). We should only note that the changes cannot be backported for LLVM 20 because the hashing changes cause existing on-disk modules to break.

@vgvassilev
Copy link
Member

@hahnjo, great job. That part was not clear to me either but since the patch departed quite far from the original state I decided I must be missing something. If we get to the same levels of memory and runtime consumption I can propose the next two steps:

  1. Pipe this through cmssw (cc: @smuzaffar)
  2. Open a PR against LLVM and we can ask google to run it on their infrastructure. If that works we land it and you become famous ;)

@hahnjo
Copy link
Member

hahnjo commented Mar 7, 2025

Ok, but with which changes? The PR currently has 1&2, but I think we need to get agreement on which modifications we actually want. And then we should first test ourselves (in llvm-project) before inviting external tests, otherwise that's just a waste of resources.

@vgvassilev
Copy link
Member

Ok, but with which changes? The PR currently has 1&2, but I think we need to get agreement on which modifications we actually want. And then we should first test ourselves (in llvm-project) before inviting external tests, otherwise that's just a waste of resources.

Personally I want the patch as is if it gets us the memory footprint of what we have in the master. I do not see a reason to split it causing more confusion upstream.

@hahnjo hahnjo force-pushed the dev.lazytemplate branch 2 times, most recently from 77dcc83 to f4e61d3 Compare March 26, 2025 12:49
@hahnjo hahnjo closed this Mar 26, 2025
@hahnjo hahnjo reopened this Mar 26, 2025
@hahnjo hahnjo closed this Mar 27, 2025
@hahnjo hahnjo reopened this Mar 27, 2025
@hahnjo hahnjo force-pushed the dev.lazytemplate branch 2 times, most recently from 599e306 to 1b65dcd Compare March 28, 2025 10:20
@hahnjo
Copy link
Member

hahnjo commented Mar 28, 2025

This seems to look good on our side. @smuzaffar would it be possible to run this through CMSSW testing? Thanks in advance!

@smuzaffar
Copy link
Contributor

This seems to look good on our side. @smuzaffar would it be possible to run this through CMSSW testing? Thanks in advance!

cmssw tests are running via cms-sw#221

@smuzaffar
Copy link
Contributor

cmssw tests passed cms-sw#221 (comment)

@hahnjo hahnjo force-pushed the dev.lazytemplate branch from 1b65dcd to deb3e7a Compare April 7, 2025 12:27
@hahnjo
Copy link
Member

hahnjo commented Apr 7, 2025

Latest numbers:

master

$ /usr/bin/time --verbose bin/root -l -b -q -e 'std::vector<int> vec = {1, 2, 3, 4, 5};'

Command exited with non-zero status 255
        Command being timed: "bin/root -l -b -q -e std::vector<int> vec = {1, 2, 3, 4, 5};"
        User time (seconds): 0.10
        System time (seconds): 0.06
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
        Maximum resident set size (kbytes): 212928

devajithvs:dev.lazytemplate

$ /usr/bin/time --verbose bin/root -l -b -q -e 'std::vector<int> vec = {1, 2, 3, 4, 5};'

Command exited with non-zero status 255
        Command being timed: "bin/root -l -b -q -e std::vector<int> vec = {1, 2, 3, 4, 5};"
        User time (seconds): 0.09
        System time (seconds): 0.06
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.15
        Maximum resident set size (kbytes): 214500

@vgvassilev
Copy link
Member

Slightly worse in memory but tolerable.

Copy link
Member

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hahnjo hahnjo merged commit 7f390ce into root-project:master Apr 8, 2025
19 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean build Ask CI to do non-incremental build on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants