Skip to content

Updates for runTheMatrix.py: input checks, GPUs repartition, input recycling #47377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 25, 2025

Conversation

AdrianoDee
Copy link
Contributor

@AdrianoDee AdrianoDee commented Feb 17, 2025

PR description:

This PR proposes a few modifications to runTheMatrix.py and correlated packages. It would add the possibility to:

  1. check if the default samples for the workflows requested are actually defined. This is done via the -c/--checkInputs flag. This should solve [RFC] Minimal test of Configuration/PyReleaseValidation/python/relval_steps.py validity #46910 if in the routine PR tests one runs runTheMatrix.py -n -c;

  2. have a workflow start from a specific step (GEN, SIM, DIGI, ...) with the option --startFrom STEP. This will remove all the steps before the one with a cmsDriver.py with -s STEP, [...];

  3. use a different file as input with --recycle. This is intended to be used either together with --startFrom either on wfs that, as first step, use a pre-existing input;

  4. have duplicate wfs in input with option -l WF, WF, WF [...] with --allowDuplicates. Each wf would run in a different job (if specified) and _jobX is appended to the work area to avoid using the same folder;

And when running with the -gpu option and multiple jobs with -j N now each job would be assigned to a different GPU. Available GPUs may be also selected on the basis of the compute capability (only for NVIDIA) with the already existing --cuda-capabilities or by name with the already existing --force-gpu-name. If more jobs than available GPUs are requested, the job to GPU assignment will restart from the first GPU available until completion. So, e.g., with 8 jobs and 3 GPUs:

  • GPU 0 -> jobs [0, 3, 6]
  • GPU 1 -> jobs [1, 4, 7]
  • GPU 2 -> jobs [2, 5]

This should solve #47337

@AdrianoDee AdrianoDee changed the title Add SimpleTrackValidation Analyzer Updates for runTheMatrix.py Feb 17, 2025
@cmsbuild cmsbuild added this to the CMSSW_15_1_X milestone Feb 17, 2025
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 17, 2025

cms-bot internal usage

@AdrianoDee AdrianoDee marked this pull request as draft February 17, 2025 16:28
@AdrianoDee AdrianoDee changed the title Updates for runTheMatrix.py Updates for runTheMatrix.py: input checks, GPUs repartition, input recycling Feb 17, 2025
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47377/43734

@AdrianoDee AdrianoDee marked this pull request as ready for review February 17, 2025 16:29
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @AdrianoDee for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (upgrade, pdmv)

@AdrianoDee, @Moanwar, @cmsbuild, @DickyChant, @miquork, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @fabiocos, @makortel, @missirol, @slomeo this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@AdrianoDee
Copy link
Contributor Author

test parameters:

  • relval_opts = -c

@AdrianoDee
Copy link
Contributor Author

enable gpu

@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals RelVals-INPUT
Size: This PR adds an extra 108KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6d8ed7/44443/summary.html
COMMIT: 37dbec2
CMSSW: CMSSW_15_1_X_2025-02-17-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47377/44443/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

ERROR Running runTheMatrix for '-s -l 9.0,101.0,1306.0,10224.0,25202.0,250202.181'

RelVals-INPUT

ERROR Running runTheMatrix for '-l 4.17,4.22,4.23,4.24,4.25,4.26,4.27,4.28,4.29,4.34,4.36,4.37,4.4,4.41,4.42,4.43,4.44,4.45,4.51,4.52,4.53,4.54,4.55,4.57,4.58,4.6,4.61,4.62,4.63,4.64,4.65,4.67,4.68,4.71,4.72,4.73,4.74,4.75,4.76,4.77,4.78,134.701,134.702,134.703,134.704,134.705,134.706,134.707,134.708,134.709,134.71,134.801,134.802,134.803,134.804,134.805,134.806,134.807,134.808,134.809,134.81,134.811,134.812,134.813,134.901,134.902,134.903,134.904,134.905,134.906,134.907,134.908,134.909,134.91,134.911,134.912,136.721,136.722,136.723,136.724,136.725,136.726,136.727,136.728,136.729,136.73,136.731,136.732,136.733,136.734,136.735,136.736,136.737,136.738,136.739,136.74,136.741,136.742,136.743,136.744,136.745,136.746,136.747,136.748,136.749,136.75,136.751,136.752,136.753,136.754,136.755,136.756,136.757,136.758,136.759,136.76,136.761,136.762,136.763,136.764,136.765,136.766,136.767,136.768,136.769,136.77,136.771,136.772,136.773,136.774,136.775,136.776,136.777,136.778,136.779,136.78,136.7801,136.7802,136.7803,136.781,136.782,136.783,136.784,136.785,136.786,136.787,136.788,136.789,136.79,136.791,136.792,136.793,136.794,136.795,136.796,136.797,136.798,136.799,136.8,136.801,136.802,136.803,136.804,136.805,136.806,136.807,136.808,136.809,136.81,136.811,136.812,136.813,136.814,136.815,136.816,136.817,136.818,136.819,136.82,136.821,136.822,136.823,136.824,136.825,136.826,136.827,136.828,136.829,136.83,136.831,136.832,136.833,136.834,136.835,136.836,136.837,136.838,136.839,136.8391,136.84,136.841,136.842,136.843,136.844,136.845,136.846,136.847,136.848,136.849,136.85,136.8501,136.851,136.852,136.853,136.854,136.855,136.856,136.8561,136.8562,136.857,136.858,136.859,136.86,136.861,136.862,136.863,136.864,136.8642,136.865,136.866,136.867,136.868,136.869,136.87,136.871,136.872,136.873,136.874,136.875,136.876,136.877,136.878,136.879,136.88,136.881,136.882,136.883,136.884,136.885,136.8855,136.886,136.8861,136.8862,136.887,136.888,136.8885,136.889,136.89,136.891,136.892,136.893,136.894,136.895,136.896,136.897,136.898,136.899,136.901,136.902,136.903,136.904,137.8,138.1,138.2,138.3,138.4,138.5,139.001,139.002,139.003,139.004,139.005,140.001,140.002,140.003,140.004,140.005,140.006,140.007,140.008,140.009,140.01,140.011,140.021,140.022,140.023,140.024,140.025,140.026,140.027,140.028,140.029,140.03,140.031,140.042,140.043,140.044,140.045,140.046,140.047,140.048,140.049,140.05,140.051,140.062,140.063,140.064,140.065,140.066,140.067,140.068,140.069,140.071,140.072,140.073,140.074,140.075,140.076,140.077,140.078,140.101,140.102,140.103,140.104,140.105,140.106,140.107,140.108,140.109,140.11,140.111,140.112,140.113,140.56,140.5611,140.57,140.58,140.6,140.61,141.001,141.002,141.003,141.004,141.005,141.006,141.007,141.008,141.008405,141.008411,141.008421,141.009,141.01,141.011,141.012,141.013,141.031,141.032,141.033,141.034,141.035,141.036,141.037,141.038,141.039,141.041,141.042,141.043,141.044,141.045,141.046,141.047,141.048,141.049,141.101,141.102,141.103,141.104,141.105,141.106,141.107,141.108,141.109,141.11,141.111,141.112,141.113,141.114,141.901,141.902,142.0,142.901,142.902,142.903,143.901,143.902,143.911,145.0,145.001,145.002,145.003,145.004,145.005,145.006,145.007,145.008,145.009,145.01,145.011,145.012,145.013,145.014,145.1,145.101,145.102,145.103,145.104,145.105,145.106,145.107,145.108,145.109,145.11,145.111,145.112,145.113,145.114,145.2,145.201,145.202,145.203,145.204,145.205,145.206,145.207,145.208,145.209,145.21,145.211,145.212,145.213,145.214,145.3,145.301,145.302,145.303,145.304,145.305,145.306,145.307,145.308,145.309,145.31,145.311,145.312,145.313,145.314,145.4,145.401,145.402,145.403,145.404,145.405,145.406,145.407,145.408,145.409,145.41,145.411,145.412,145.413,145.414,145.5,145.501,145.502,145.503,145.504,145.505,145.506,145.507,145.508,145.509,145.51,145.511,145.512,145.513,145.514,145.6,145.601,145.602,145.603,145.604,145.605,145.606,145.607,145.608,145.609,145.61,145.611,145.612,145.613,145.614,145.7,145.701,145.702,145.703,145.704,145.705,145.706,145.707,145.708,145.709,145.71,145.711,145.712,145.713,145.714,159.01,134.0,134.99601,134.99602,134.99603,134.99901,144.6,11024.2,1000.0,1001.0,1001.2,1001.3,1001.4,1002.0,1002.3,1002.4,1002.5,1003.0,1005.0,1010.0,1020.0,1030.0,1040.0,1040.1,1041.0,1042.0,1046.0,1047.0,1048.0,1049.0,1052.0,1052.1,2500.001,2500.002,2500.003,2500.011,2500.012,2500.013,2500.021,2500.022,2500.023,2500.024,2500.031,2500.032,2500.033,2500.034,2500.101,2500.111,2500.112,2500.131,2500.201,2500.211,2500.212,2500.221,2500.222,2500.223,2500.224,2500.225,2500.226,2500.227,2500.228,2500.231,2500.232,2500.233,2500.234,2500.235,2500.236,2500.237,2500.238,2500.241,2500.242,2500.243,2500.244,2500.245,2500.251,2500.301,2500.311,2500.901,2500.902'

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 867
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52204
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Mar 22, 2025

There seems to be an unrelated problem with the CUDA drivers on the worker node.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 23, 2025

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6d8ed7/45158/summary.html
COMMIT: 0f9e1bb
CMSSW: CMSSW_15_1_X_2025-03-23-0000/el8_amd64_gcc12
Additional Tests: ROCM,CUDA
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47377/45158/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 2 lines to the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3909207
  • DQMHistoTests: Total failures: 67
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3909120
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 38
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53033
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • You potentially removed 80 lines from the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 37
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53034
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files

@AdrianoDee
Copy link
Contributor Author

+pdmv

@Moanwar
Copy link
Contributor

Moanwar commented Mar 24, 2025

+Upgrade

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

REMINDER @mandrenguyen, @sextonkennedy, @rappoccio, @antoniovilela: This PR was tested with #47669, please check if they should be merged together

@fwyzard
Copy link
Contributor

fwyzard commented Mar 24, 2025

This PR and #47669 can be merged independently.

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 56e8707 into cms-sw:master Mar 25, 2025
18 checks passed
@AdrianoDee AdrianoDee deleted the recycle_and_checks_runthematrix branch March 26, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants