Skip to content

zephyr: exempt preemption requeues from MAX_SHARD_FAILURES#4990

Merged
ravwojdyla merged 4 commits intomainfrom
worktree-rav-peemption-shard-retires
Apr 21, 2026
Merged

zephyr: exempt preemption requeues from MAX_SHARD_FAILURES#4990
ravwojdyla merged 4 commits intomainfrom
worktree-rav-peemption-shard-retires

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 21, 2026

  • preemption-driven requeues (register_worker, heartbeat timeout) no longer consume the 3-attempt MAX_SHARD_FAILURES budget — previously three clean preemptions aborted the pipeline
  • split _task_attempts into two counters: generation (bumps on every requeue, drives stale-result rejection in report_result) and _task_error_attempts (TASK-only, still capped at MAX_SHARD_FAILURES=3)
  • add ShardFailureKind.{TASK,INFRA}; report_errorTASK, _maybe_requeue_worker_taskINFRA
  • INFRA requeues retry unbounded 1
  • tests
    • replace test_heartbeat_death_aborts_at_max_shard_failures (codified the bug) with test_heartbeat_timeouts_do_not_count_toward_shard_failures — 15 timeouts + final success
    • add test_worker_reregistration_does_not_count_toward_shard_failures
    • add test_report_error_still_aborts_at_max_shard_failures_after_preemptions — pins that real task errors still abort at 3 even after prior preemptions

Footnotes

  1. _check_worker_group already aborts if the worker job permanently exhausts Iris' own retry budget, so unbounded infra retries cannot loop forever

Previously all three requeue paths (register_worker, heartbeat timeout,
report_error) fed the same counter, so three preemptions of a clean
shard aborted the whole pipeline. Split the counter into a generation
(bumps on every requeue, used for stale-result rejection) and a
TASK-only error budget. INFRA requeues (preemption) retry unbounded;
_check_worker_group still backstops if workers permanently crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 21, 2026
ravwojdyla and others added 2 commits April 20, 2026 20:43
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla ravwojdyla requested review from rjpower and yonromai April 21, 2026 03:45
@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 21, 2026

@claude fix all of the local imports in the test file, these violate our standards.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @rjpower's task in 5m 13s —— View job


Fixing local imports in test file

  • Read lib/zephyr/tests/test_execution.py to identify local imports
  • Move local imports to top of file
  • Run ./infra/pre-commit.py --all-files --fix
  • Run uv run pytest on affected tests
  • Commit and push

Summary

Removed local imports from ~20 test functions and consolidated them at the module top. The added top-level imports are:

  • stdlib: logging, os, from unittest.mock import MagicMock
  • third-party: from fray.v2.local_backend import LocalClient
  • project: from zephyr import counters, from zephyr.plan import compute_plan, and the combined zephyr.execution import now covers MAX_SHARD_FAILURES, ListShard, PickleDiskChunk, ShardTask, TaskResult, WorkerState, ZephyrCoordinator, ZephyrWorkerError (previously imported ad-hoc inside tests).

Net: +19 / -87 lines in lib/zephyr/tests/test_execution.py.

Verification

  • ./infra/pre-commit.py --all-files --fix — OK (ruff, black, pyrefly all pass)
  • uv run --with pytest pytest lib/zephyr/tests/test_execution.py -x42 passed in 77.31s (one unrelated Threads leaked from test warning on test_shared_data[iris], not introduced by this change)

Commit: a560ff75b on branch worktree-rav-peemption-shard-retires.

Per repo import standards (AGENTS.md): all imports at the top of the
file. Remove local imports from ~20 test functions and consolidate at
the module level.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@ravwojdyla ravwojdyla merged commit 8786714 into main Apr 21, 2026
40 checks passed
@ravwojdyla ravwojdyla deleted the worktree-rav-peemption-shard-retires branch April 21, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants