Skip to content

Phase-2 mkFit: propagation to plane / Kalman operations on plane / Matriplex with support for scalar operations and VDT #46325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 17, 2024

Conversation

mmasciov
Copy link
Contributor

@mmasciov mmasciov commented Oct 9, 2024

PR description:

This PR is focusing on mkFit developments for phase-2.
While originally meant to be purely technical (mkFit is not yet enabled by default in phase-2), a small bug was fixed that is slightly affecting the phase-1/Run-3 performance. This was reported at Tracking POG on October 7, 2024. Changes for Run-3 are anyway negligible.

For phase-2 mkFit, this PR:

  • introduces propagation to plane and Kalman operations on plane during pattern recognition;
  • introduces support for scalar operations and VDT in Matriplex.

Propagation to plane is used as a mkFit-default choice for phase-2 (where mkFit is not deployed yet), being required to select hits on tilted layers, while phase-1 configuration is left unchanged (no significant gain wrt. current default).

In addition, standalone mkFit functionalities are also extended/improved.

PR validation:

Please, refer to presentation at Tracking POG on October 7, 2024 (including MTV results).

--> For phase-1 (Run-3), (only physics) performance is only slightly affected, due to a small bug-fix in application of material effects (commit 4a6088d).

--> For phase-2, mkFit physics performance (mkFit is not used by default so far) is largely improved.

FYI: @kskovpen

@mmasciov
Copy link
Contributor Author

mmasciov commented Oct 9, 2024

type tracking

@cmsbuild cmsbuild added this to the CMSSW_14_2_X milestone Oct 9, 2024
@mmasciov mmasciov changed the title hase-2 mkFit: propagation to plane / Kalman operations on plane / Matriplex with support for scalar operations and VDT Phase-2 mkFit: propagation to plane / Kalman operations on plane / Matriplex with support for scalar operations and VDT Oct 9, 2024
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 9, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 9, 2024

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 9, 2024

A new Pull Request was created by @mmasciov for master.

It involves the following packages:

  • RecoTracker/MkFit (reconstruction)
  • RecoTracker/MkFitCMS (reconstruction)
  • RecoTracker/MkFitCore (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @felicepantaleo, @gpetruc, @makortel, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmasciov
Copy link
Contributor Author

mmasciov commented Oct 9, 2024

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 9, 2024

-1

Failed Tests: Build ClangBuild
Size: This PR adds an extra 472KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42084/summary.html
COMMIT: 03f36f3
CMSSW: CMSSW_14_2_X_2024-10-09-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46325/42084/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42084/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42084/git-merge-result

Build

I found compilation warning when building: See details on the summary page.

Clang Build

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #46325 was updated. @cmsbuild, @jfernan2, @mandrenguyen can you please check and sign again.

@cmsbuild cmsbuild removed the hold label Oct 15, 2024
@slava77
Copy link
Contributor

slava77 commented Oct 15, 2024

@mmasciov can you squash commits please?

@mandrenguyen, can you please clarify if you'd like me to squash the 31 commits into a single one, or to try and reduce to a given number?

A reasonable reduction would be sufficient for me. Thanks!

is there some clear quantifiable rationale/guideline for the number of commits per PR?
In this case we have 8K (+6/–2K) lines changed in 71 files over 30 commits; this is coming for developments that accumulated over about a year.
I'm pretty sure most other updates in CMSSW are smaller in terms of lines of change per commit. (we just had a few hundred tens PRs with just a few lines changed for LLVM etc).

What @mmasciov has done looks reasonable to me.

is it possible to know in advance to have some guideline?
We/Mario did squash some commits before making this PR and it seemed reasonable for the mkFit team already.

@mmasciov
Copy link
Contributor Author

@jfernan2, I suspect that profiling tests got stuck. Should I restart the tests once more?

@mandrenguyen
Copy link
Contributor

@mmasciov can you squash commits please?

@mandrenguyen, can you please clarify if you'd like me to squash the 31 commits into a single one, or to try and reduce to a given number?

A reasonable reduction would be sufficient for me. Thanks!

is there some clear quantifiable rationale/guideline for the number of commits per PR?
In this case we have 8K (+6/–2K) lines changed in 71 files over 30 commits; this is coming for developments that accumulated over about a year.
I'm pretty sure most other updates in CMSSW are smaller in terms of lines of change per commit. (we just had a few hundred tens PRs with just a few lines changed for LLVM etc).

What @mmasciov has done looks reasonable to me.

is it possible to know in advance to have some guideline? We/Mario did squash some commits before making this PR and it seemed reasonable for the mkFit team already.

It was just a suggestion to clean things up a bit. I don't have any guideline besides common sense.

@mmasciov
Copy link
Contributor Author

@cmsbuild please abort

@mmasciov
Copy link
Contributor Author

@cmsbuild please test

@slava77
Copy link
Contributor

slava77 commented Oct 16, 2024

is it possible to know in advance to have some guideline? We/Mario did squash some commits before making this PR and it seemed reasonable for the mkFit team already.

It was just a suggestion to clean things up a bit. I don't have any guideline besides common sense.

Thanks for clarifying. It still feels more like a non-specific personal taste requirement.

It may help to prepare some guideline, in particular for larger PRs.
Perhaps also to take in consideration the costs (loss of useful history detail; rerunning the lengthy tests; possibly days of delay to re-collect signatures).

Clearly, commits like "fix", "fix" deserve squashing or change in comments; I'm not sure there were any like that in this case.

@mmasciov
Copy link
Contributor Author

@jfernan2, I suspect that profiling tests got stuck. Should I restart the tests once more?

I have restarted the tests. However, please note that we did measure timing performance.
In phase-1, there's no visible difference (black is target; blue is reference; red is a variation of target):
PR46325-timing

In phase-2, and only for initialStep, mkFit would allow for a reduction of the building time (as shown in https://indico.cern.ch/event/1461703/contributions/6168739/attachments/2942096/5169307/mkFit-Phase2-7October2024.pdf#page=8). However, this is not enabled by default, therefore this PR will have no impact on the default phase-2 timing.

@makortel
Copy link
Contributor

possibly days of delay to re-collect signatures

Signature recollection is no longer necessary when the diff itself doesn't change (see #43845).

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42237/summary.html
COMMIT: 62bccea
CMSSW: CMSSW_14_2_X_2024-10-15-2300/el8_amd64_gcc12
Additional Tests: PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46325/42237/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 33 lines to the logs
  • Reco comparison results: 24955 differences found in the comparisons
  • DQMHistoTests: Total files compared: 45
  • DQMHistoTests: Total histograms compared: 3441825
  • DQMHistoTests: Total failures: 30409
  • DQMHistoTests: Total nulls: 16
  • DQMHistoTests: Total successes: 3411380
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.06999999999999998 KiB( 44 files compared)
  • DQMHistoSizes: changed ( 140.023 ): 0.016 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 140.043 ): -0.016 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 140.063 ): 0.031 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 141.042 ): 0.051 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 141.044 ): -0.008 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 141.046 ): -0.004 KiB JetMET/SUSYDQM
  • Checked 197 log files, 168 edm output root files, 45 DQM output files
  • TriggerResults: found differences in 3 / 43 workflows

@mmasciov
Copy link
Contributor Author

@mandrenguyen, this should also be ready to be merged, since there's no requirement for signature recollection.

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@jfernan2
Copy link
Contributor

@jfernan2, I suspect that profiling tests got stuck. Should I restart the tests once more?

I have restarted the tests. However, please note that we did measure timing performance. In phase-1, there's no visible difference (black is target; blue is reference; red is a variation of target): PR46325-timing

In phase-2, and only for initialStep, mkFit would allow for a reduction of the building time (as shown in https://indico.cern.ch/event/1461703/contributions/6168739/attachments/2942096/5169307/mkFit-Phase2-7October2024.pdf#page=8). However, this is not enabled by default, therefore this PR will have no impact on the default phase-2 timing.

Thanks @mmasciov
I have signed the PR, however the change in Run3 wfs from mkFit does not look negligible:

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42237/profiling/12634.21/diff-step3_cpu.resources.json.html

@mandrenguyen
Copy link
Contributor

@mandrenguyen, this should also be ready to be merged, since there's no requirement for signature recollection.

Sure. Can you just respond to the question from Javier please? Then I'll merge it.

@mmasciov
Copy link
Contributor Author

@jfernan2, I suspect that profiling tests got stuck. Should I restart the tests once more?

I have restarted the tests. However, please note that we did measure timing performance. In phase-1, there's no visible difference (black is target; blue is reference; red is a variation of target): PR46325-timing
In phase-2, and only for initialStep, mkFit would allow for a reduction of the building time (as shown in https://indico.cern.ch/event/1461703/contributions/6168739/attachments/2942096/5169307/mkFit-Phase2-7October2024.pdf#page=8). However, this is not enabled by default, therefore this PR will have no impact on the default phase-2 timing.

Thanks @mmasciov I have signed the PR, however the change in Run3 wfs from mkFit does not look negligible:

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42237/profiling/12634.21/diff-step3_cpu.resources.json.html

Thanks. The positive differences that are related to mkFit that I see are mostly related to MkFitOutputConverter. This may be explained by a very slight increase of the hit multiplicity after the bug fix, and sums up to O(10 ms). On the other hand, I see that 28 ms are saved in pixelLessStepTrackCandidatesMkFit.
If one looks at the relative size of all these differences, it is similar or even smaller than differences found in other modules that have nothing to do with mkFit, which would indicate, together with our independent timing measurements, that the overall difference is not statistically significant.

@mmasciov
Copy link
Contributor Author

@mandrenguyen, this should also be ready to be merged, since there's no requirement for signature recollection.

Sure. Can you just respond to the question from Javier please? Then I'll merge it.

Sure, done: #46325 (comment)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit f97584b into cms-sw:master Oct 17, 2024
13 checks passed
@slava77
Copy link
Contributor

slava77 commented Oct 17, 2024

I have signed the PR, however the change in Run3 wfs from mkFit does not look negligible:

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ed71cc/42237/profiling/12634.21/diff-step3_cpu.resources.json.html

is there a way to sort this thing by e.g. cost for wrt total job time (or anything else)?
alpha-ordered by the class name is hard to read, considering significant unrelated fluctuations are also present

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants