Skip to content

Implement the CheckClassInfo module #47567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Mar 12, 2025

PR description:

Implement the CheckClassInfo module. This module will query the TClass, ClassProperty and ClassInfo of all the persistent products specified in its configuration.

Add an automated test that checks the ClassInfo of all persistent HLT products.

PR validation:

The new tests is expected to fail, until the underlying issue is understood and fixed.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

May be backported to 15.0.x to ensure the proper behaviour in that release cycle.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 12, 2025

cms-bot internal usage

@fwyzard fwyzard force-pushed the implement_CheckClassInfo branch from f488b3a to 6206a1b Compare March 12, 2025 00:35
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47567/44054

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • FWCore/TestModules (core)

@Dr15Jones, @makortel, @smuzaffar can you please review it and eventually sign? Thanks.
@makortel, @missirol, @wddgit this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5a85b5/44931/summary.html
COMMIT: 6206a1b
CMSSW: CMSSW_15_1_X_2025-03-11-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47567/44931/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test TestFWCoreModulesCheckClassInfo had ERRORS

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3920300
  • DQMHistoTests: Total failures: 77
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3920203
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 214 log files, 184 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

The new unit test fails as expected: testing.log .

process = cms.Process("TEST")

# load the latest HLT configuration in order to exercise a large number of event products
process.load('HLTrigger.Configuration.HLT_GRun_cff')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is targeting specifically the HLT menu event products, would it make sense to move the test in HLTrigger/Configuration, or do you reckon this would be executed anyway if we change the list of those in the menu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not targeted specifically at HLT, though I run into the problem with the HLT configuration.
It is just a simple way to construct a large number of modules in a complex enough configuration.

Ideally we could run this test with all CMSSW configurations (oneline, offline, legacy, Run 3, Phase 2, etc.) any time there are changes to data formats, dictionaries, or ROOT itself.

But I'm not sure what would be a way to set that up, short of adding this check to all cmsDriver jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my framework developer perspective, I really would not want any dependencies outside of the framework packages itself to be in FWCore/TestModules, that includes python dependencies. When I do work, I only checkout exactly the group of packages need to build and test the framework. Having this test here would vastly extend what would have to be checked out.

So I'm all for the module existing in FWCore/TestModules, but it would be better to test with just the modules defined within the FWCore subsystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm all for the module existing in FWCore/TestModules, but it would be better to test with just the modules defined within the FWCore subsystem.

The problem is that restricting the test to the small number of framework-only data formats and producers will very likely not reproduce the current issue, making the test much less useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cms-sw/operations-l2 @cms-sw/pdmv-l2 (assuming cmsDriver is under either of those group of L2s), would you find it acceptable to add this module to at least some of the cmsDriver workflows ?

I'm looking for a way to test that ROOT is able to properly load the dictionaries for all the persistent data formats used in CMSSW, even in the presence of multiple threads and streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind there are two sets of tests. One is a unit test which uses a controlled situation to prove that the module does what it is supposed to do, i.e. raise an 'alarm' when it encounters the problem. Such a test I would see living in FWCore/TestModules/test. The second test is an integration test that makes sure CMSSW as a whole (or at least important parts) does not have the problem. That test should not be in FWCore/TestModules/test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second test is an integration test that makes sure CMSSW as a whole (or at least important parts) does not have the problem.

Actually, why does the framework not make this check for all products in all jobs ?

Isn't it a requirement to be able to store the non-transient collections ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why does the framework not make this check for all products in all jobs ?

Isn't it a requirement to be able to store the non-transient collections ?

The dictionary existence checks have largely relied to TClass::GetDict() and TClass::GetMissingDictionaries() (and maybe something else I don't remember right now). We had no idea TClass::ClassProperty() or TClass::GetClassInfo() would make a difference, that they apparently with -Wl,--as-needed they do (#47470).

Perhaps we should consider adding the TClass::ClassProperty() and TClass::GetClassInfo() calls to the dictionary existence checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary existence checks have largely relied to TClass::GetDict() and TClass::GetMissingDictionaries() (and maybe something else I don't remember right now). We had no idea TClass::ClassProperty() or TClass::GetClassInfo() would make a difference, that they apparently with -Wl,--as-needed they do (#47470).

Mhm... I didn't know about #47470, thanks for the pointer.

I've backported this test to CMSSW 14.2.0, which should not have the -Wl,--as-needed flags, and I still see the same failures:

  • a crash when checking the products from the full HLT menu
  • an exception when checking a smaller HCAL+PF workflow

So this error seems unrelated to -Wl,--as-needed.

Perhaps we should consider adding the TClass::ClassProperty() and TClass::GetClassInfo() calls to the dictionary existence checks.

By the way, TClass::ClassProperty() seems fine, in the sense that I haven't sencountered ay problems with it. TClass::GetClassInfo() is the one that fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've backported this test to CMSSW 14.2.0, which should not have the -Wl,--as-needed flags, and I still see the same failures:

Thanks for the test. So the TClass::GetClassInfo() does not seem to be relevant for discovering the memory use increase discussed in #47470.

Perhaps we should consider adding the TClass::ClassProperty() and TClass::GetClassInfo() calls to the dictionary existence checks.

By the way, TClass::ClassProperty() seems fine, in the sense that I haven't sencountered ay problems with it. TClass::GetClassInfo() is the one that fails.

I took a look of what TClass::GetClassInfo() does. It calls LoadClassInfo() if it hasn't been called yet (https://github.com/root-project/root/blob/28a45b686066b309e753c4a919d1c21f10aed528/core/meta/inc/TClass.h#L433-L437), and LoadClassInfo() auto-parses the class header unless the auto-parsing is disabled (https://github.com/root-project/root/blob/28a45b686066b309e753c4a919d1c21f10aed528/core/meta/src/TClass.cxx#L5824-L5855). So I don't think we want to call that function from the framework (given that auto-parsing typically leads to higher memory use). Its information does not seem to be necessary for the I/O.

@fwyzard Could you describe the context that lead you to call the TClass::GetClassInfo() and experience the failures?

@fwyzard fwyzard force-pushed the implement_CheckClassInfo branch from 6206a1b to 22ba0a9 Compare March 12, 2025 20:25
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47567/44067

@cmsbuild
Copy link
Contributor

Pull request #47567 was updated. @Dr15Jones, @makortel, @smuzaffar can you please check and sign again.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

Using std::cerr we get the message about what collection is causing the issue (from the previous test):

product 128falserecoPFRecHitSoALayoutPortableHostCollection_hltParticleFlowRecHitHBHESoASerialSync__TEST :
  wrapper type edm::Wrapper<PortableHostCollection<reco::PFRecHitSoALayout<128,false> > >
  TClass pointer 0x15390773c500
  TClass property 4561
  ClassInfo pointer0x1538d81d4b00

product 128falserecoZVertexLayout128falserecoZVertexTracksLayoutPortableHostMultiCollection_hltPixelVerticesSoA__TEST :
  wrapper type edm::Wrapper<PortableHostMultiCollection<reco::ZVertexLayout<128,false>,reco::ZVertexTracksLayout<128,false> > >
  TClass pointer 0x15390459de80
  TClass property 4561
In file included from DataFormatsVertexSoA_xr dictionary payload:75:
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02880/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-03-11-2300/src/DataFormats/VertexSoA/interface/ZVertexSoA.h:12:3: error: pasting formed 'BOOST_PP_NOT_EQUAL_<U+0000>', an invalid preprocessing token
  GENERATE_SOA_LAYOUT(ZVertexLayout,
  ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02880/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-03-11-2300/src/DataFormats/SoATemplate/interface/SoALayout.h: note: expanded from macro 'GENERATE_SOA_LAYOUT'
...

Using the MessageLogger that message is lost (from the latest test):

product 128falserecoPFRecHitSoALayoutPortableHostCollection_hltParticleFlowRecHitHBHESoASerialSync__TEST :
  wrapper type edm::Wrapper<PortableHostCollection<reco::PFRecHitSoALayout<128,false> > >
  TClass pointer 0x14c4cd0cce00
  TClass property 4561
  ClassInfo pointer0x14c49dc8c200

In file included from DataFormatsVertexSoA_xr dictionary payload:75:
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02880/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-03-12-1100/src/DataFormats/VertexSoA/interface/ZVertexSoA.h:12:3: error: pasting formed 'BOOST_PP_NOT_EQUAL_<U+0000>', an invalid preprocessing token
  GENERATE_SOA_LAYOUT(ZVertexLayout,
  ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02880/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-03-12-1100/src/DataFormats/SoATemplate/interface/SoALayout.h: note: expanded from macro 'GENERATE_SOA_LAYOUT'
...

Throwing an exception instead of using an assertion does not help, because the job fails before reaching the check.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47567/44069

@cmsbuild
Copy link
Contributor

Pull request #47567 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar can you please check and sign again.

Add an automated test that checks the ClassInfo of all persistent HLT products.
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47567/44070

@cmsbuild
Copy link
Contributor

Pull request #47567 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar can you please check and sign again.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

@pcanal running the same test of a different configuration step3.py I get a different error:

cmsRun step3.py
...
The product 128falserecoPFRecHitSoALayoutPortableHostCollection_pfRecHitSoAProducerHBHEOnly__RECO has:
  wrapper type edm::Wrapper<PortableHostCollection<reco::PFRecHitSoALayout<128,false> > >
  TClass pointer 0x7f4b580a2300
  TClass property 4561
----- Begin Fatal Exception 12-Mar-2025 22:53:48 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PFRecHitSoAProducerHCAL@alpaka label='pfRecHitSoAProducerHBHEOnly'
   Additional Info:
      [a] Fatal Root Error: @SUB=TClass::LoadClassInfo
no interpreter information for class edm::Wrapper<PortableHostCollection<reco::PFRecHitSoALayout<128,false> > > is available even though it has a TClass initialization routine.

----- End Fatal Exception -------------------------------------------------

I don't know if this helps isolating the issue.

@makortel
Copy link
Contributor

Using std::cerr we get the message about what collection is causing the issue (from the previous test):
...
Using the MessageLogger that message is lost (from the latest test):

Ah, the point is that the JIT (via TClass::GetClassInfo()) can fail via crashing than exception, right? In that case something that flushes the buffer after each line would indeed make sense.

Would it maybe work better to use many individual edm::LogVerbatim instead of a single one ?

It could. At least the object destructor (that inserts the message into the queue) would be run before the line that could trigger the crash.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 12, 2025

Would it maybe work better to use many individual edm::LogVerbatim instead of a single one ?

It could. At least the object destructor (that inserts the message into the queue) would be run before the line that could trigger the crash.

Yes, it works:

The product 128falserecoZVertexLayout128falserecoZVertexTracksLayoutPortableHostMultiCollection_hltPixelVerticesSoA__TEST has:
  wrapper type edm::Wrapper<PortableHostMultiCollection<reco::ZVertexLayout<128,false>,reco::ZVertexTracksLayout<128,false> > >
  TClass pointer 0x14faf1d6b780
  TClass property 4561
In file included from DataFormatsVertexSoA_xr dictionary payload:75:
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02880/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-03-12-1100/src/DataFormats/VertexSoA/interface/ZVertexSoA.h:12:3: error: pasting formed 'BOOST_PP_NOT_EQUAL_<U+0000>', an invalid preprocessing token
  GENERATE_SOA_LAYOUT(ZVertexLayout,
  ^

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5a85b5/44950/summary.html
COMMIT: 5d7b664
CMSSW: CMSSW_15_1_X_2025-03-12-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47567/44950/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 2 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS
---> test TestFWCoreModulesCheckClassInfo had ERRORS

Comparison Summary

Summary:

@pcanal
Copy link
Contributor

pcanal commented Mar 13, 2025

Perhaps we should consider adding the TClass::ClassProperty() and TClass::GetClassInfo() calls to the dictionary existence check

Calling TClass::[Class]Property is fine if you want by unnessary. Do NOT call TClass::GetClassInfo(), it is an explicit request to load the interpreter information for the class and will lead for sure to header parsing ... well unless that is the goal :)

@pcanal
Copy link
Contributor

pcanal commented Mar 13, 2025

no interpreter information for class edm::Wrapper<PortableHostCollection<reco::PFRecHitSoALayout<128,false> > > is available even though it has a TClass initialization routine.

This usually indicates that the autoparsing for that classes failed. The 3 mains reasons:

  • header files for that classes (and/or dependencies) is not registered (unlikely case)
  • header files are not found (e.g. they were not installed)
  • header parsing failed (if that happens it might be that in the runtime environment things are setup differently (different version of header files, different macro sets) than was used in the dictionary generation.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 13, 2025

No, the goal is not to force the header parsing, it's to use the dictionaries without parsing the headers.

Then I need to reproduce the original errors without calling GetClassInfo().

@fwyzard fwyzard marked this pull request as draft March 28, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants