zephyr: exempt preemption requeues from MAX_SHARD_FAILURES#4990
Merged
ravwojdyla merged 4 commits intomainfrom Apr 21, 2026
Merged
zephyr: exempt preemption requeues from MAX_SHARD_FAILURES#4990ravwojdyla merged 4 commits intomainfrom
ravwojdyla merged 4 commits intomainfrom
Conversation
Previously all three requeue paths (register_worker, heartbeat timeout, report_error) fed the same counter, so three preemptions of a clean shard aborted the whole pipeline. Split the counter into a generation (bumps on every requeue, used for stale-result rejection) and a TASK-only error budget. INFRA requeues (preemption) retry unbounded; _check_worker_group still backstops if workers permanently crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
@claude fix all of the local imports in the test file, these violate our standards. |
rjpower
approved these changes
Apr 21, 2026
Contributor
|
Claude finished @rjpower's task in 5m 13s —— View job Fixing local imports in test file
SummaryRemoved local imports from ~20 test functions and consolidated them at the module top. The added top-level imports are:
Net: Verification
Commit: |
Per repo import standards (AGENTS.md): all imports at the top of the file. Remove local imports from ~20 test functions and consolidate at the module level. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
yonromai
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
register_worker, heartbeat timeout) no longer consume the 3-attemptMAX_SHARD_FAILURESbudget — previously three clean preemptions aborted the pipeline_task_attemptsinto two counters: generation (bumps on every requeue, drives stale-result rejection inreport_result) and_task_error_attempts(TASK-only, still capped atMAX_SHARD_FAILURES=3)ShardFailureKind.{TASK,INFRA};report_error→TASK,_maybe_requeue_worker_task→INFRAINFRArequeues retry unbounded 1test_heartbeat_death_aborts_at_max_shard_failures(codified the bug) withtest_heartbeat_timeouts_do_not_count_toward_shard_failures— 15 timeouts + final successtest_worker_reregistration_does_not_count_toward_shard_failurestest_report_error_still_aborts_at_max_shard_failures_after_preemptions— pins that real task errors still abort at 3 even after prior preemptionsFootnotes
_check_worker_groupalready aborts if the worker job permanently exhausts Iris' own retry budget, so unbounded infra retries cannot loop forever ↩