Use RRN and Exitcodes to overhaul resubmission and retry policy#9319
Use RRN and Exitcodes to overhaul resubmission and retry policy#9319aspiringmind-code wants to merge 7 commits into
Conversation
|
Jenkins results:
|
|
Jenkins results:
|
|
Please add in the top description what RRN stands for. Also a note in the code may help ! |
There was a problem hiding this comment.
Intermediate review before I dive into PreJob and RetryJob.
I flagged places where I failed to understand the logic.
I believe that it would be important to write down somewhere the "design", and avoid that the reader has to reconstruct it by reading and testing the code as we had to do with current one.
New code surely is elegantly written and clear, but the overall interplay of DagmanResubmitter, AdjustSites, and Post/Pre/RetryJob deserves some text.
| This classAd is written atomically by DagmanResubmitter as a JSON string | ||
| containing both epoch and job_ids, so we never see a mismatched pair. | ||
| """ | ||
| if 'CRAB_ResubmitRecord' not in ad: |
There was a problem hiding this comment.
do you expect this to be missing at times ? Or is it there only to catch code bugs ? If the latter, why not making it fatal ? Same for the folllwing try/except on JSON parsing. E.g. if json format is OK but one key is missing , code will crash.
There was a problem hiding this comment.
CRAB_ResubmitRecord is intentionally absent on first task submission since DagmanResubmitter has never run. This is the normal case and not an error condition. The classAd only exists after the first crab resubmit command is issued. In the following try except you are right, I could make it fatal.
There was a problem hiding this comment.
Thanks. Please add an inline comment to line 228: # this is first crab resubmit for this task
| data = json.load(fd) | ||
|
|
||
| last_reset_epoch = data.get('last_reset_epoch', 0) | ||
| if last_reset_epoch >= current_epoch: |
There was a problem hiding this comment.
I fail to understand how the epoch written previously in the file can be larger than the one which was just sent from DagmanResubmitter (if I understood correctly the naming).
Can you please check and clarify in the comments ?
There was a problem hiding this comment.
The > case should never happen - it means either resubmit_record.json was corrupted/rolled back or last_reset_epoch was written incorrectly. It's safer to skip than to incorrectly zero out retries. The == case is the idempotency case: rare DAGMan restarts (schedd hiccup, no new resubmit). In both cases we continue/skip without resetting rrn.
There was a problem hiding this comment.
FOr things that should never happen do you mean to skip the resubmit, or skip the reset and still take actions ? The latter seems dangerous.
| # Write epoch separately too so DagmanResubmitter can read it back next time | ||
| schedd.edit(rootConst, "CRAB_ResubmitEpoch", str(newEpoch)) | ||
| # Write the full record atomically | ||
| schedd.edit(rootConst, "CRAB_ResubmitRecord", classad.quote(resubmitRecord)) |
There was a problem hiding this comment.
so this Epoch is basically a counter of how many times DagmanResubmitter successfully managed to edit the bootstrap job classAds. Correct ?
There was a problem hiding this comment.
Yes, correct. Its sole purpose is to give resetRRN() a way to answer "has this job's RRN already been reset for the current resubmit request?" in an idempotent way. Each successful DagmanResubmitter execution produces a unique epoch value, and resetRRN() stamps that epoch onto each job file it resets so that subsequent runs of AdjustSites.py (from DAGMan restarts) can recognise the reset was already done and skip it.
There was a problem hiding this comment.
that's nice, but naming is not ideal !
| Read RRN from the individual job's classAd where PreJob stamped it. | ||
| This is completely independent of dag_retry, crab_retry, and max_retries. | ||
| Falls back to rrn_info file if classAd not available (e.g. job never ran | ||
| because PreJob itself failed). |
There was a problem hiding this comment.
if PreJob failed and job never run, how can the PostJob be running ? Should this be a fatal error rather than a condition on which to fall back and keep going ?
There was a problem hiding this comment.
Yes this should be fatal too.
I have added the design write up at the top |
|
WOW, your have been amazingly fast in adding an extensive documentation !!! I am afraid I will not be as fast in digesting ! |
|
Hi @aspiringmind-code ! I have finally made it to digest the new documentation and read all code. It all looks fine to me. Just let me recap here all my suggestions (very few !):
Thanks for this, |
Fix #9318
For larger discussion on how we reached here, see #9264 and (few last comments in) #9276
Design Write-Up: CRABServer Resubmission Policy Refactor
This PR introduces three interlocking concepts:
RetryJob.pythat maps exit codes to retry actions (delay, memory boost, runtime boost, site change).resubmit_record.jsonand epoch tracking : a mechanism to allow user-triggered resubmissions to reset RRN counters for targeted jobs, enabling a clean slate while preserving history. Note that epoch is a counter, not time.The Three Counters:
dag_retry,crab_retry, andrrnUnderstanding the difference between these is essential.
dag_retry($RETRY)crab_retryretry_info/job.<id>.txtrrnrrn_info/job.<id>.txtcrab_retryis used as a key/index intoresubmit_info/job.<id>.txtto store per-attempt parameters (memory, runtime, site lists, exit code info). It is strictly increasing and never resets, it serves as a stable record key.rrnis used exclusively to answer: "Has this job been tried enough times overall?" It resets to 0 when the user explicitly resubmits, giving the job a fresh budget ofCRAB_NumAutomJobRetries(now 10, up from 2) attempts. PreJob stampsCRAB_RRNinto the job's classAd so PostJob can read it without touching the filesystem.dag_retrystill controls the DAGMan machinery, but PostJob no longer uses it as the gate for "too many retries" , it usesrrninstead.The Lifecycle of a Single Job Attempt
Step 1 — PreJob runs:
get_rrn(): readsrrn_info/job.<id>.txt, increments therrnfield, writes it back (atomic rename, file-locked).My.CRAB_RRN = <rrn>into the job submit classAd.resubmit_info/job.<id>.txtat keycrab_retry - 1(the previous attempt's data).increase_memoryis set, multiplies current memory bymemory_factor(capped at 7500 MB).increase_runtimeis set, multiplies current walltime byruntime_factor(capped at 47 h).change_sitefrom the previous attempt. If set, and there is more than one available site, removes the previous failing site from the candidate set.retry_delay_untilfromresubmit_infofor the currentdag_retrykey. If now < that timestamp, returns True fromneedsDefer()and the job is deferred (held and re-released after the delay).resubmit_info/job.<id>.txtat keystr(crab_retry).Step 2 — Job runs on the worker node.
Step 3 — RetryJob runs (if the job finished with non-zero exit code):
apply_retry_policy(exitCode).exitCodeinEXIT_RETRY_POLICY. Falls back to"default"(neutral, no adjustments) if not found.store_retry_actions(policy, exitCode): writes intoresubmit_info/job.<id>.txtat keystr(crab_retry):retry_delay_until = time.time() + policy["delay"](typically 900 s)increase_memory,increase_runtime,change_sitebooleans and their factorssite= the site where this attempt ran (so PreJob can discard it next time if needed)exitCode"recoverable", dispatches to a handler (if registered), then raisesRecoverableError."neutral", returns without raising, falls through to the existingFatalErrorraise at the bottom ofcheck_exit_code.Step 4 — PostJob runs:
get_rrn(): readsCRAB_RRNfrom the job classAd (stamped by PreJob). Falls back torrn_info/job.<id>.txtif classAd unavailable (e.g., PreJob itself crashed before the job ran).get_max_retry(): readsCRAB_NumAutomJobRetriesfrom the task classAd (default 10).retryjob_retval == RECOVERABLE_ERROR: checksrrn >= max_retry. If yes → fatal. Otherwise → recoverable (DAGMan will retry).The EXIT_RETRY_POLICY Dictionary
The old
RetryJob.pyhad a linearif/elifchain. The new code replaces it with a data-driven dictionary:Key additions:
increase_memory/memory_factor: triggers a 1.3× memory increase on the next attempt.increase_runtime/runtime_factor: triggers a 1.3× walltime increase on the next attempt.change_site: triggers removal of the failing site from the candidate set on the next attempt.handler: name of a method onRetryJobto call before raisingRecoverableError(for exit-code-specific logic like checking corrupted files or CVMFS issues)."neutral"type: the exit code is not recognized as explicitly recoverable; falls through to the existing fatal-error path rather than either retrying blindly or failing definitively.The three handler methods (
handle_file_open_or_root_error,handle_sigabrt,handle_cvmfs_or_cms_exception) contain the same logic as the old inlineifblocks, they are just extracted into named methods and dispatched via the policy table.User-Triggered Resubmission and RRN Reset
When a user calls
crab resubmit,DagmanResubmitter.pyruns. The new code adds:Both
epochandjob_idsare written atomically as a single JSON classAd value (CRAB_ResubmitRecord). This prevents AdjustSites from ever seeing a mismatched (new epoch, old job list) pair.When DAGMan restarts and
AdjustSites.pyruns:writeResubmitRecord(ad)readsCRAB_ResubmitRecordfrom the task classAd and writes it torrn_info/resubmit_record.json. Idempotent: skips if the on-disk epoch already matches.resetRRN()readsrrn_info/resubmit_record.json. For each targeted job (or all jobs ifjob_idsis null), opensrrn_info/job.<id>.txt, checkslast_reset_epoch. Iflast_reset_epoch < current_epoch, setsrrn = 0. Iflast_reset_epoch >= current_epoch, we skip with continue. This idempotency in == means DAGMan can restart AdjustSites multiple times safely. The > case should not happen until things have gone very wrong.After this reset, when PreJob runs for a resubmitted job,
get_rrn()findsrrn = 0and increments it to 1, effectively giving the job a new budget ofmax_retryattempts.reuse_rrn: Handling Interrupted PreJobThere is a subtle race condition: PreJob can crash or be killed after incrementing RRN but before the job actually runs. This also happens when a resubmission of a different job id happens while another is running. When DAGMan retries the node, PreJob runs again. At this point:
retry_infoshowspre > post(PreJob ran more times than PostJob)job_out.<id>.<retry>does not exist (job never ran)Originally, this case set
prejob_exit_code = 1(error) only if a job_out existed, implying the job had run. In the new code anelsebranch is added: whenpre > postand nojob_out, it setsself.reuse_rrn = True. Thenget_rrn(), instead of incrementing, just returns the currentrrnvalue unchanged. This prevents RRN from being double-incremented for a single logical job attempt where PreJob happened to restart.Summary of Key Changes
dag_retry(DAGMan internal)rrn(CRAB-owned, per job)$MAX_RETRIESbaked into DAGCRAB_NumAutomJobRetriesin task classAdif/elifchain inRetryJobEXIT_RETRY_POLICYdictretry_delay_untilchecked in PreJobadjustMaxRetriesincreased max_retries by 1 + numAutomJobRetrieslast_reset_epochper jobreuse_rrn=True— no double-increment