[zephyr] Fix _check_worker_group false abort after completed stage#4140
Merged
[zephyr] Fix _check_worker_group false abort after completed stage#4140
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09219d04d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
0db407b to
ccd6fdd
Compare
Contributor
|
Looks like this pr is mixing multiple changes |
_check_worker_group unconditionally treats worker_group.is_done()==True as a crash, even when all shards completed successfully. After the last stage, workers exit cleanly via SHUTDOWN but the coordinator background loop sees the Iris job as finished and aborts. Only affects Iris backend (Local/Ray hardcode is_done to False).
_check_worker_group now skips when all shards are completed. Previously it unconditionally treated worker_group.is_done()==True as a crash, even when workers exited cleanly after receiving SHUTDOWN on the last stage. This caused flaky failures on the Iris backend where is_done() checks real job status (Local/Ray hardcode it to False).
ccd6fdd to
0fd80fd
Compare
Helw150
pushed a commit
that referenced
this pull request
Apr 8, 2026
…4140) _check_worker_group unconditionally treated worker_group.is_done()==True as a crash. After the last stage, workers exit cleanly via SHUTDOWN, Iris marks the job finished, and the coordinator background loop aborts with "Worker job terminated permanently" even though all shards completed. Only affects Iris (Local/Ray hardcode is_done to False). Adds a completed-shards guard to _check_worker_group and three regression tests. Fixes #4117
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_check_worker_group unconditionally treated worker_group.is_done()==True as a
crash. After the last stage, workers exit cleanly via SHUTDOWN, Iris marks the
job finished, and the coordinator background loop aborts with "Worker job
terminated permanently" even though all shards completed. Only affects Iris
(Local/Ray hardcode is_done to False). Adds a completed-shards guard to
_check_worker_group and three regression tests.
Fixes #4117