Skip to content

Conversation

@wddgit
Copy link
Contributor

@wddgit wddgit commented Mar 21, 2025

PR description:

Remove from SubProcess from EventProcessor related code.

I've intentionally left uses of SubProcess related to the EventSetup, Services, and module deletion for separate later PRs, but this one gets everything else.

This is the second PR in a series. The first was PR #47565. That PR removed support for SubProcess from the CMS configuration system and made it impossible to run a job with a SubProcess configured. It also removed all SubProcess tests.

Resolves cms-sw/framework-team#1282

PR validation:

Relies on existing tests. There is no new functionality. This is cleaning up code that served no purpose anymore.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 21, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47659/44194

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit for master.

It involves the following packages:

  • DataFormats/Provenance (core)
  • FWCore/Common (core)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/ParameterSet (core)
  • FWCore/TestProcessor (core)
  • IOPool/Common (core)
  • IOPool/Output (core)

@Dr15Jones, @cmsbuild, @makortel, @smuzaffar can you please review it and eventually sign? Thanks.
@makortel, @missirol, @mmusich, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Mar 21, 2025

enable threading

@wddgit
Copy link
Contributor Author

wddgit commented Mar 21, 2025

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 168KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ccc39/45144/summary.html
COMMIT: 7dc0b16
CMSSW: CMSSW_15_1_X_2025-03-21-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47659/45144/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 2 lines to the logs
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3909207
  • DQMHistoTests: Total failures: 51
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3909136
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Comparison differences are related to #47071

PathsAndConsumesOfModules pathsAndConsumesOfModules;
pathsAndConsumesOfModules.initialize(schedule_.get(), preg());

std::vector<ModuleProcessName> consumedBySubProcesses;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the consumedBySubProcess something to be cleaned up in a subsequent PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is part of the module deletion code. I was going to tackle that next.

}) | chain::runLast(std::move(task));
ServiceRegistry::Operate op(serviceToken_);
schedule_->writeProcessBlockAsync(
task, principalCache_.processBlockPrincipal(processBlockType), &processContext_, actReg_.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task could be moved

Suggested change
task, principalCache_.processBlockPrincipal(processBlockType), &processContext_, actReg_.get());
std::move(task), principalCache_.processBlockPrincipal(processBlockType), &processContext_, actReg_.get());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to not move so that the ServiceRegistry::Operate goes out of scope before the task goes out of scope. I believe there is the extremely unlikely possibility the Services stay on a very long time and cause problems during shutdown, because once we call doneWaiting everything else continues to completion and we are off to the races. IIRC, we actually saw that problem in the past in some rare cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Could you then add a comment here that the lifetime of the task must be longer than the lifetime of op for that reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There 10 or 20 places in the code like this in EventProcessor. Should I mark them all? Maybe a good idea. This is something easy to forget about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the two comments you suggested. If there is more related to this I would suggest deferring it until I submit the PR dealing with SubProcess and Services. Let me know about if there is something more you want me to do and I'll get it into the first version of that PR. (That is assuming it isn't something that is going to cause an immediate problem.)

}
}) | chain::runLast(std::move(task));
ServiceRegistry::Operate op(serviceToken_);
schedule_->writeRunAsync(task, runPrincipal, &processContext_, actReg_.get(), mergeableRunProductMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task could be moved

Suggested change
schedule_->writeRunAsync(task, runPrincipal, &processContext_, actReg_.get(), mergeableRunProductMetadata);
schedule_->writeRunAsync(std::move(task), runPrincipal, &processContext_, actReg_.get(), mergeableRunProductMetadata);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment.

for (unsigned int i = 0; i < preallocations_.numberOfStreams(); ++i) {
schedule_->processOneStreamAsync<Traits>(
WaitingTaskHolder(taskGroup_, &streamLoopWaitTask), i, transitionInfo, serviceToken_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the replacement of beginGlobalTransitionAsync() with schedule_->processOneGlobalAsync() fine as such, although one could make an argument against along "if we ever want to have a customization point there, we need to add the beginGlobalTransitionAsync() back".

I'm less certain about the replacement of beginStreamsTransitionAsync() with the loop over streams calling schedule_->processOneStreamAsync(). Or in other words, I'd find some value in keeping the loop over streams abstracted in beginStreamsTransitionAsync().

Further thoughts? @wddgit @Dr15Jones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What ever we do we should keep EventProcessor and TestProcessor using the same code, preferable by sharing the same routines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop over streams is already handled very differently in EventProcessor and TestProcessor. The one in EventProcessor involves complex concurrency code. In the loop over streams in EventProcessor, tasks are inserted into the streamQueues_. For example, this is the loop over streams in EventProcessor::beginRunAsync. beginStreamsTransitionAsync is not called at all. The call to beginStreamTransitionAsync (almost the same but no "s" in the function name) is inside streamBeginRunAsync.

                            streamQueuesInserter_.push(*holder.group(), [this, status, holder]() mutable {
                              for (unsigned int i = 0; i < preallocations_.numberOfStreams(); ++i) {
                                CMS_SA_ALLOW try {
                                  streamQueues_[i].push(*holder.group(), [this, i, status, holder]() mutable {
                                    streamBeginRunAsync(i, std::move(status), std::move(holder));
                                  });
                                } catch (...) {
                                  if (status->streamFinishedBeginRun()) {
                                    WaitingTaskHolder copyHolder(holder);
                                    copyHolder.doneWaiting(std::current_exception());
                                    status->resetBeginResources();
                                    queueWhichWaitsForIOVsToFinish_.resume();
                                    exceptionRunStatus_ = status;
                                  }
                                }
                              }
                            });

TestProcessor does not have support for this level of concurrency. Even before this PR, TestProcessor is the only thing in all of CMSSW using beginStreamsTransitionAsync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add the beginStreamTransitionAsync back, but it will simply call directly through to the other function and add no value at all. It seems to me, it is just as likely to get in the way as help if a new version of SubProcess is implemented. And in the meantime removing it simplifies the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two new functions in TestProcessor, processTransitionForAllStreams and processGlobalTransition. They serve to get rid of some of the code duplication in this part of the code as we discussed.

emptyList,
false);
for (unsigned int i = 0; i < preallocations_.numberOfStreams(); ++i) {
schedule_->processOneStreamAsync<Traits>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop has a problem. Each iteration creates a new WaitingTaskHolder, while the old code made 1 which lived for the lifetime of the loop. If the thread running the loop was stalled, then the already started scheduled could finish their jobs an cause all the previously created WaitingTaskHolders to go away thereby starting the held task BEFORE this loop finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would agree that in our code this pattern would cause problems. But in this case, it is in a FinalWaitingTask section of code. It will finish the loop before it gets to the wait. I think it will be OK in this case.

          FinalWaitingTask streamLoopWaitTask{taskGroup_};

          using Traits = OccurrenceTraits<LuminosityBlockPrincipal, BranchActionStreamEnd>;

          for (unsigned int i = 0; i < preallocations_.numberOfStreams(); ++i) {
            schedule_->processOneStreamAsync<Traits>(
                WaitingTaskHolder(taskGroup_, &streamLoopWaitTask), i, transitionInfo, serviceToken_, false);
          }

          streamLoopWaitTask.wait();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chris was right about this. Fixed. Thanks.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 9, 2025

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47659/44447

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #47659 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar can you please check and sign again.

@wddgit
Copy link
Contributor Author

wddgit commented Apr 14, 2025

please test

Only change is a rebase on top of the other PR (which was merged) to resolve the conflict. We should remove the hold if the tests pass.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 76KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ccc39/45544/summary.html
COMMIT: f2757a0
CMSSW: CMSSW_15_1_X_2025-04-14-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47659/45544/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@makortel
Copy link
Contributor

Comparison differences are related to #47071

@makortel
Copy link
Contributor

+core

@wddgit
Copy link
Contributor Author

wddgit commented Apr 17, 2025

@mandrenguyen This PR is not high priority, but I just wanted to make sure it was still on your list of things that need to be done. The other PR was merged and seems OK. I rebased this one already on top of it and this one is ready to merge. I think the hold can be released.

@makortel
Copy link
Contributor

unhold

Although I don't think I have the power to do that

@mandrenguyen
Copy link
Contributor

mandrenguyen commented Apr 21, 2025

unhold
apologies @wddgit I was off for a few days

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit b656ba0 into cms-sw:master Apr 21, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove SubProcess from EventProcessor and Run/Lumi/Event related code

5 participants