Skip to content

add exit code dependent retry policy#9276

Closed
aspiringmind-code wants to merge 37 commits into
dmwm:masterfrom
aspiringmind-code:improve_resubmit
Closed

add exit code dependent retry policy#9276
aspiringmind-code wants to merge 37 commits into
dmwm:masterfrom
aspiringmind-code:improve_resubmit

Conversation

@aspiringmind-code

@aspiringmind-code aspiringmind-code commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Fix #9264

Salient Features of the New ExitCode Dependent Resubmission and Retry Policy:

  • Automatic calculation of job id specific resubmit_counter based on lastExitCode dependent max_retries and crab_retry
  • retries_consumed_for_ec resets to 0 when exitCode changes
  • We start with 100 retries limit. But it increases with every resubmit. By how much is ExitCode dependent. Most popular case would be the limit becoming 103, 106,... with each resubmit. So effectively the user does not hit an upperbound as long as they wish to resubmit. This preserves current behaviour
  • In the case of no parameters specified with resubmission, if exit code suggests that increasing maxmemory or maxjobruntime can help, automatic increase by 30% tried until hitting the upperbounds of 7500 MB and 47 hours respectively
  • retry delay of 900 s is added for each recoverable error exit code. This delay can be changed for each exit Code
  • If exitcode shows that changing site will help, the site of the last retry run is dropped from the availableSet for the current retry. Note that the site is not blacklisted and will be available for subsequent retries. Also note that this feature happens even in the retries of simple submit (i.e. not only for crab resubmit but also for retries of simple crab submit)
  • The source of truth is EXIT_RETRY_POLICY dictionary in RetryJob. It has every recoverable exit code with keys like type, max_retries, delay, msg, change_site, increase_memory, memory_factor, runtime_factor, increase_runtime and handler
  • effective_max_retries will be exit code dependent calculated by the formula (base_max + 1)*(resubmit_counter + 1) - 1 where base_max is specified in the policy dict. So for an exit code like 8020 every resubmit will give additional 3 retries. For exitcode 8028 every resubmit will give additional 10 retries.
  • When effective_max_retries for a recoverable error is reached FatalError is raised. User can do resubmit.
  • Only recoverable errors specified in the policy are retried. Fatal errors will pass through the default route and won't be retried.
  • ToDos for subsequent improvement:
    Make files inside resubmit_info json instead of txt
    Avoid short and long exit code redundancy in the EXIT_RETRY_POLICY

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2738/artifact/artifacts/PullRequestReport.html

Comment thread src/python/TaskWorker/Actions/RetryJob.py Outdated
Comment thread src/python/TaskWorker/Actions/RetryJob.py

@belforte belforte left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see a couple inline comments

@belforte

Copy link
Copy Markdown
Member

more on the "substance", it is not good to use sleep inside RetryJob i.e. inside PostJob (which calls this).
We are limited by the how many PostJob can run concurrently, due to memory constrain. The preferred implementation would be what is done when waiting for ASO, exit with a proper exit code which tells Dagman to rerun the Post (or Pre ?) step after a delay (example of delay in PreJob is the use of deferTime in there).

Notice that delaying the PostJob also delays the status reporting, the DAG node is still not completed. Rather once we introduce re-submission delays of several hours (days ?) we should worry about properly reporting this to user.
I think that currently jobs are reported in "toRetry" or "cooloff" (unfortunately there is some inconsistency) when the DAG node is completed with error but not resubmitted yet. At least that's a status that appears at times, but I have
not done a careful study of the current implementation.

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 105 comments to review
  • Pycodestyle check: succeeded
    • 185 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2739/artifact/artifacts/PullRequestReport.html

Comment thread src/python/TaskWorker/Actions/PreJob.py
@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 110 comments to review
  • Pycodestyle check: succeeded
    • 191 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2740/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 223 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2766/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 222 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2769/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 221 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2770/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 122 comments to review
  • Pycodestyle check: succeeded
    • 264 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2771/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 114 comments to review
  • Pycodestyle check: succeeded
    • 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2772/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 115 comments to review
  • Pycodestyle check: succeeded
    • 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2773/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 117 comments to review
  • Pycodestyle check: succeeded
    • 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2776/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 117 comments to review
  • Pycodestyle check: succeeded
    • 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2777/artifact/artifacts/PullRequestReport.html

@aspiringmind-code aspiringmind-code marked this pull request as ready for review April 1, 2026 14:25
@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 167 comments to review
  • Pycodestyle check: succeeded
    • 250 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2810/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 167 comments to review
  • Pycodestyle check: succeeded
    • 250 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2811/artifact/artifacts/PullRequestReport.html

@belforte belforte left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review. I still need to look at last 300 lines of RetryJob

try:
with open(retry_info_file, "r", encoding="utf-8") as fd:
retry_info = literal_eval(fd.read())
except Exception:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except FileNotFoundError maybe ? which also pleases pylint :-)
same in other places where you want to say that it is OK that this file is not there yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment thread src/python/TaskWorker/Actions/RetryJob.py
"""
Handle exit codes related to file open/read/root failures (8020, 8021, 8022, 8028, 84, 85, 86, 92).
Checks for corrupted input files; if found, creates a fake FJR with code 8022
and allows a retry. Otherwise raises RecoverableError with the policy message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise ? It looks to me that a RecoverableError is always raised

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. maybe Otherwise refers to the message, not the action (raise). Or maybe I do no really understand what Otherwise means !

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, removed.

Comment thread src/python/TaskWorker/Actions/PreJob.py Outdated
with open(file_name, 'r', encoding='utf-8') as fd:
self.resubmit_info = literal_eval(fd.read())

def get_resubmit_counter(self, exit_code, crab_retry):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... do this calculation here and calculate_effective_max_retries() in RetryJob assume that exit code and hence max_retries are constant across all resubmissions ?

Exit code can and do change. But of course in current code max_retries is constant so things are easy.

Why don't you store a resubmit_counter in the resubmit_info file ? PreJob knows whether it is being run after a crab resumit and which job_ids where in that command (lines 265--> )

The fix for calculate_effective_max_retries() is not as trivial. And we need to make a design decision first. I am starting to think that there is no good way to handle different max_retries for different exit codes. E.g. a job may have fail to read, followed by get-stuck-in-read-and-timing-out then now-it-reads-but-goes-out-of-memory and then memory-increase-was-not-detected--by-condor-and-got-a-SIGTERM
and in between any number of landed-on-bad-node-and-got-a-cvmfs-error...

Please, convince me that I have no reason to worry

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I would really like to avoid these calculations, hard to understand and a bit fragile

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this Comment! I have now made this function to just grab the resubmit_counter from resubmit_info. And now we calculate retries_consumed_for_ec in store_retry_actions which resets to 0 when exit code changes. This way we become dynamic and don't risk running less or more retries. Just as much as the last exitcode indicated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this might not be the right approach...let me think for a solution

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that connecting max_retried to exit code makes much sense. It is only meaningful if error "sticks" (e.g. increase max-memory a bit at a time). But if we expect different errors to pop up.. which max_retry should win ?
Digging history records for examples to learn from, feels like too much work for no definitive conclusion anyhow. Best we could do is to record last exit code so we can deal with increase memory/time. Another thing one may want to increase is waiting time, sort of exponential backoff from bad sites.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mind is currently favoring a fixed max_retry (10 ?) as long as error stays recoverable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds better to me too. So remove max_retries from the EXIT_RETRY_POLICY altogether and make a global self.max_retries set at 9?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would solve, right ? Anyhow let's keep thinking for a bit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all in all the most frequent error is 8028 and it is not so clear how we want to deal with that

Comment thread src/python/TaskWorker/Actions/PreJob.py Outdated
numcores = None
priority = None
if not use_resubmit_info: # means thad we resubmit with new params from crab resubmit
inkey = str(crab_retry) if crab_retry == 0 else str(crab_retry - 1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inkey = "0" if not crab_retry else str(crab_retry - 1) seems more pythonic, at least "0" right after the =

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did the "at least" version for my clarity

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 167 comments to review
  • Pycodestyle check: succeeded
    • 250 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2812/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 179 comments to review
  • Pycodestyle check: succeeded
    • 253 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2813/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 179 comments to review
  • Pycodestyle check: succeeded
    • 253 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2814/artifact/artifacts/PullRequestReport.html

@belforte belforte left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review concluded

numcores = None
priority = None
if not use_resubmit_info: # means thad we resubmit with new params from crab resubmit
inkey = "0" if crab_retry == 0 else str(crab_retry - 1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am always confused by inkey and outkey, which predates your and mine putting hands in this. Please add a comment like
# read information about last retry (if any)
hopefully combination of that with the comment about outkey in lines 350-351 will help next time I read

Comment thread src/python/TaskWorker/Actions/PreJob.py Outdated
with open(file_name, 'r', encoding='utf-8') as fd:
self.resubmit_info = literal_eval(fd.read())

def get_resubmit_counter(self, exit_code, crab_retry):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would solve, right ? Anyhow let's keep thinking for a bit.

Comment thread src/python/TaskWorker/Actions/PreJob.py Outdated
with open(file_name, 'r', encoding='utf-8') as fd:
self.resubmit_info = literal_eval(fd.read())

def get_resubmit_counter(self, exit_code, crab_retry):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all in all the most frequent error is 8028 and it is not so clear how we want to deal with that

@belforte

Copy link
Copy Markdown
Member

I have read all the code now and completed my review. Looks to me that all is clear, with exception of ongoing discussion on how to handle max_retry: error-dependent or fixed ? Which is a sort of major design decision and deserves some thinking.

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 179 comments to review
  • Pycodestyle check: succeeded
    • 253 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2816/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 161 comments to review
  • Pycodestyle check: succeeded
    • 243 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2817/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 161 comments to review
  • Pycodestyle check: succeeded
    • 243 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2819/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 161 comments to review
  • Pycodestyle check: succeeded
    • 243 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2820/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 161 comments to review
  • Pycodestyle check: succeeded
    • 243 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2821/artifact/artifacts/PullRequestReport.html

@belforte

belforte commented Apr 29, 2026

Copy link
Copy Markdown
Member

In today's (29/04/2026) meeting we discussed how to count/limit retries and resubmissions.
Here's summary

  • we agree that maxRetry can not be exit-code dependend. Each job which completes with recoverable error is retried, up to a maximum. A fatal error terminates automatic retries. We start with maxRetry=10
  • There are 3 things which one may want to count (and keep track of)
    • retryNumber like now, the number of times a job is submitted for a given crabId, from to max in DagMan (100 currently). This is needed to uniquely identify the condor job, pre/post-job logs, records in OpenSearch etc. LIke now.
      • Its Semantics is not changing.
    • resubmitNumber the number of times a user has issued a crab resubmit for this crabId. Easy to define. Hard to compute and track.
      • Its usefulness is not clear, other than as possibly having a role in computing next counter retryInThisResubmissionCycleNumber
    • retryInThisResubmissionCycleNumber, let's shorten to RRN : Retry in (Re)submission Number
      • Semantics: this counter is what must be compared with maxRetry (i.e. 10) to decide if a new automatic retry (i.e. a new condor_submit) should be done or the job set to failed due to max reached.
      • It is not immediately obvious how to determine this. We discussed two ways:
        1. guess the resubmission cycle by using mod(retryNumber,10) and set RRN as retryNumber-resubmitNumber*10 (maybe +1 or -1 ... a detail). Pros: easy and simple . Cons: in case of fatal errors next resubmit may have <10 retries, down to possible just one, such undefined max_retry may be confusing for ops and users.
        2. increment RRN at every job submission, but reset to 0 when crab resubmit is processed. Pros: this is exactly what we want. Cons: more code is needed and it is difficult (impossible) to avoid that a Dagman restart (schedd failure or whatever) looks like a resubmit

Of course we need to track (i.e. persist on disk) the counters which we will be using, currently this is being done via DAG retry counter inside dagman log and node_status files, and crabId specific files in SPOOL_DIR/retry_info/job.<crabId>.txt . New code in this PR also introduces resubmit_info/job.<crabId>.txt. We discussed changing to JSON. If RRN is not coomputed in PreJob (1. above) it will need to be persisted on disk too.

Stefano (who proposed 2.) has some ideas on its implementation:

Idea1. not good: Use classAd CRAB_ResubmitList in PreJob.py to know that current crabId is being resubmitted. Along the lines of how it is used in PreJob to set use_resubmit_info. It does not work because that classAd "sticks" and the PreJob has no way to know if this is a resubmission or not. Also code will get confused if another resubmit arrives while dagman was still starting, possibly with a different jobId list, so the previous list gets lost before it was fully used. We saw during tests that some jobs could "not be resubmitted".

Idea2. not clearly wrong: Leverage adjustSites.py inside dag_bootstrap_startup.sh. It runs when dagman is not running, so we know that nothing changes in DAG status. Persist RRN in for each Job in some file (one per task will suffice). Increment the jobId value in each PreJob (need a lock due to concurrency, or go for thousand small files again) and compare with maxRetry in PostJob (like now) to decide whether to submit again or not. Inside adjustSites.py i.e. at each dagman restart, use classAd CRAB_ResubmitList to reset RRN for those jobs.

  • In a way this reproduces current (Brian's) implementation where RRN is the Dagman Retry counter and adjust sites hacks Dagman log and status files to do the resetting. By letting Dag node retry counter run free and moving the RRN counter in our scope, we gain robustness, clarity and flexibility. Hopefully all the places in the code where DAG_RETRY was used have been changed already in this PR.
  • About resubmission vs. dagman restarts, since CRAB_ResubmitList sticks, every Dagman restart will look to the code like a new crab resubmit with same list of jobs to resubmit as last one. This should have no effect on crabIds which have completed by "now", but for those which were being retried it will reset the RRN counter, basically giving them a higher max_retry. Having more retries is not as bad as having less (does not decrease chances for success!) and, since such unintended dagman restarts are rare, chances for confusions are much fewer. An idea to tell resbumits from restarts, could be to inroduce a `thisIsResumit' classAd to be used as a boolean flag: DagmanResubmitter sets it to True, adjustSites checks it in order to decide if to reset RRN's, then sets it to False after the reset id done. So in case of a restart, RRN's will not be touched.

@cmsdmwmbot

Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
    • 161 comments to review
  • Pycodestyle check: succeeded
    • 243 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2825/artifact/artifacts/PullRequestReport.html

@aspiringmind-code

Copy link
Copy Markdown
Contributor Author

Closing with follow up in #9318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve resubmission policies

3 participants