add exit code dependent retry policy#9276
Conversation
|
Jenkins results:
|
belforte
left a comment
There was a problem hiding this comment.
see a couple inline comments
|
more on the "substance", it is not good to use Notice that delaying the PostJob also delays the status reporting, the DAG node is still not completed. Rather once we introduce re-submission delays of several hours (days ?) we should worry about properly reporting this to user. |
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
belforte
left a comment
There was a problem hiding this comment.
partial review. I still need to look at last 300 lines of RetryJob
| try: | ||
| with open(retry_info_file, "r", encoding="utf-8") as fd: | ||
| retry_info = literal_eval(fd.read()) | ||
| except Exception: |
There was a problem hiding this comment.
except FileNotFoundError maybe ? which also pleases pylint :-)
same in other places where you want to say that it is OK that this file is not there yet.
| """ | ||
| Handle exit codes related to file open/read/root failures (8020, 8021, 8022, 8028, 84, 85, 86, 92). | ||
| Checks for corrupted input files; if found, creates a fake FJR with code 8022 | ||
| and allows a retry. Otherwise raises RecoverableError with the policy message. |
There was a problem hiding this comment.
Otherwise ? It looks to me that a RecoverableError is always raised
There was a problem hiding this comment.
hmm.. maybe Otherwise refers to the message, not the action (raise). Or maybe I do no really understand what Otherwise means !
There was a problem hiding this comment.
Sorry for the confusion, removed.
| with open(file_name, 'r', encoding='utf-8') as fd: | ||
| self.resubmit_info = literal_eval(fd.read()) | ||
|
|
||
| def get_resubmit_counter(self, exit_code, crab_retry): |
There was a problem hiding this comment.
hmm... do this calculation here and calculate_effective_max_retries() in RetryJob assume that exit code and hence max_retries are constant across all resubmissions ?
Exit code can and do change. But of course in current code max_retries is constant so things are easy.
Why don't you store a resubmit_counter in the resubmit_info file ? PreJob knows whether it is being run after a crab resumit and which job_ids where in that command (lines 265--> )
The fix for calculate_effective_max_retries() is not as trivial. And we need to make a design decision first. I am starting to think that there is no good way to handle different max_retries for different exit codes. E.g. a job may have fail to read, followed by get-stuck-in-read-and-timing-out then now-it-reads-but-goes-out-of-memory and then memory-increase-was-not-detected--by-condor-and-got-a-SIGTERM
and in between any number of landed-on-bad-node-and-got-a-cvmfs-error...
Please, convince me that I have no reason to worry
There was a problem hiding this comment.
Overall, I would really like to avoid these calculations, hard to understand and a bit fragile
There was a problem hiding this comment.
Thank you for this Comment! I have now made this function to just grab the resubmit_counter from resubmit_info. And now we calculate retries_consumed_for_ec in store_retry_actions which resets to 0 when exit code changes. This way we become dynamic and don't risk running less or more retries. Just as much as the last exitcode indicated.
There was a problem hiding this comment.
Ok this might not be the right approach...let me think for a solution
There was a problem hiding this comment.
I am not sure that connecting max_retried to exit code makes much sense. It is only meaningful if error "sticks" (e.g. increase max-memory a bit at a time). But if we expect different errors to pop up.. which max_retry should win ?
Digging history records for examples to learn from, feels like too much work for no definitive conclusion anyhow. Best we could do is to record last exit code so we can deal with increase memory/time. Another thing one may want to increase is waiting time, sort of exponential backoff from bad sites.
There was a problem hiding this comment.
My mind is currently favoring a fixed max_retry (10 ?) as long as error stays recoverable.
There was a problem hiding this comment.
Yeah that sounds better to me too. So remove max_retries from the EXIT_RETRY_POLICY altogether and make a global self.max_retries set at 9?
There was a problem hiding this comment.
that would solve, right ? Anyhow let's keep thinking for a bit.
There was a problem hiding this comment.
all in all the most frequent error is 8028 and it is not so clear how we want to deal with that
| numcores = None | ||
| priority = None | ||
| if not use_resubmit_info: # means thad we resubmit with new params from crab resubmit | ||
| inkey = str(crab_retry) if crab_retry == 0 else str(crab_retry - 1) |
There was a problem hiding this comment.
inkey = "0" if not crab_retry else str(crab_retry - 1) seems more pythonic, at least "0" right after the =
There was a problem hiding this comment.
did the "at least" version for my clarity
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
| numcores = None | ||
| priority = None | ||
| if not use_resubmit_info: # means thad we resubmit with new params from crab resubmit | ||
| inkey = "0" if crab_retry == 0 else str(crab_retry - 1) |
There was a problem hiding this comment.
I am always confused by inkey and outkey, which predates your and mine putting hands in this. Please add a comment like
# read information about last retry (if any)
hopefully combination of that with the comment about outkey in lines 350-351 will help next time I read
| with open(file_name, 'r', encoding='utf-8') as fd: | ||
| self.resubmit_info = literal_eval(fd.read()) | ||
|
|
||
| def get_resubmit_counter(self, exit_code, crab_retry): |
There was a problem hiding this comment.
that would solve, right ? Anyhow let's keep thinking for a bit.
| with open(file_name, 'r', encoding='utf-8') as fd: | ||
| self.resubmit_info = literal_eval(fd.read()) | ||
|
|
||
| def get_resubmit_counter(self, exit_code, crab_retry): |
There was a problem hiding this comment.
all in all the most frequent error is 8028 and it is not so clear how we want to deal with that
|
I have read all the code now and completed my review. Looks to me that all is clear, with exception of ongoing discussion on how to handle max_retry: error-dependent or fixed ? Which is a sort of major design decision and deserves some thinking. |
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
In today's (29/04/2026) meeting we discussed how to count/limit retries and resubmissions.
Of course we need to track (i.e. persist on disk) the counters which we will be using, currently this is being done via DAG retry counter inside dagman log and node_status files, and crabId specific files in Stefano (who proposed 2.) has some ideas on its implementation: Idea1. not good: Use classAd Idea2. not clearly wrong: Leverage
|
|
Jenkins results:
|
|
Closing with follow up in #9318 |
Fix #9264
Salient Features of the New ExitCode Dependent Resubmission and Retry Policy:
resubmit_counterbased on lastExitCode dependent max_retries and crab_retryretries_consumed_for_ecresets to 0 when exitCode changestype, max_retries, delay, msg, change_site, increase_memory, memory_factor, runtime_factor, increase_runtime and handler(base_max + 1)*(resubmit_counter + 1) - 1wherebase_maxis specified in the policy dict. So for an exit code like 8020 every resubmit will give additional 3 retries. For exitcode 8028 every resubmit will give additional 10 retries.Make files inside resubmit_info json instead of txt
Avoid short and long exit code redundancy in the EXIT_RETRY_POLICY