Skip to content

[zephyr] Fix _check_worker_group false abort after completed stage#4140

Merged
rjpower merged 2 commits intomainfrom
repro/zephyr-worker-group-race-4117
Mar 25, 2026
Merged

[zephyr] Fix _check_worker_group false abort after completed stage#4140
rjpower merged 2 commits intomainfrom
repro/zephyr-worker-group-race-4117

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 25, 2026

_check_worker_group unconditionally treated worker_group.is_done()==True as a
crash. After the last stage, workers exit cleanly via SHUTDOWN, Iris marks the
job finished, and the coordinator background loop aborts with "Worker job
terminated permanently" even though all shards completed. Only affects Iris
(Local/Ray hardcode is_done to False). Adds a completed-shards guard to
_check_worker_group and three regression tests.

Fixes #4117

@rjpower rjpower added the agent-generated Created by automation/agent label Mar 25, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09219d04d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/zephyr/tests/test_worker_group_race.py Outdated
Comment thread lib/zephyr/tests/test_worker_group_race.py
@rjpower rjpower changed the title [zephyr] Reproduce _check_worker_group race causing flaky Iris test [zephyr] Fix _check_worker_group false abort after completed stage Mar 25, 2026
@rjpower rjpower requested a review from ravwojdyla March 25, 2026 17:36
@rjpower rjpower force-pushed the repro/zephyr-worker-group-race-4117 branch from 0db407b to ccd6fdd Compare March 25, 2026 18:05
@ravwojdyla
Copy link
Copy Markdown
Contributor

Looks like this pr is mixing multiple changes

rjpower added 2 commits March 25, 2026 11:58
_check_worker_group unconditionally treats worker_group.is_done()==True
as a crash, even when all shards completed successfully. After the last
stage, workers exit cleanly via SHUTDOWN but the coordinator background
loop sees the Iris job as finished and aborts. Only affects Iris backend
(Local/Ray hardcode is_done to False).
_check_worker_group now skips when all shards are completed. Previously
it unconditionally treated worker_group.is_done()==True as a crash, even
when workers exited cleanly after receiving SHUTDOWN on the last stage.
This caused flaky failures on the Iris backend where is_done() checks
real job status (Local/Ray hardcode it to False).
@rjpower rjpower force-pushed the repro/zephyr-worker-group-race-4117 branch from ccd6fdd to 0fd80fd Compare March 25, 2026 18:58
@rjpower rjpower enabled auto-merge (squash) March 25, 2026 18:59
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rjpower rjpower merged commit 8a1f6ca into main Mar 25, 2026
39 checks passed
@rjpower rjpower deleted the repro/zephyr-worker-group-race-4117 branch March 25, 2026 19:04
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…4140)

_check_worker_group unconditionally treated worker_group.is_done()==True
as a
crash. After the last stage, workers exit cleanly via SHUTDOWN, Iris
marks the
job finished, and the coordinator background loop aborts with "Worker
job
terminated permanently" even though all shards completed. Only affects
Iris
(Local/Ray hardcode is_done to False). Adds a completed-shards guard to
_check_worker_group and three regression tests.

Fixes #4117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] Flaky test_marin_pipeline_on_iris: ZephyrWorkerError in dedup_fuzzy_document

2 participants