Ensure actors set erred state properly in case of worker failure#9067
Ensure actors set erred state properly in case of worker failure#9067
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 10h 23m 20s ⏱️ + 1m 7s For more details on these failures and errors, see this check. Results for commit ba4a9ce. ± Comparison against base commit 358402d. |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses an issue where actor error transitions did not correctly set the erred state when a worker fails. Key changes include:
- Updating exception types from ValueError to RuntimeError in actor methods and test assertions.
- Adjusting the scheduler’s state transition functions to support an optional worker argument and introducing a new transition for in‐memory erred tasks.
- Modifying test cases to simulate both graceful and abrupt worker exits.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| distributed/tests/test_actor.py | Updates to test cases to expect RuntimeError and simulate actor failure scenarios |
| distributed/scheduler.py | Updates to state transition functions including signature changes and new helper for memory erred state |
| distributed/actor.py | Consistent exception type change to RuntimeError for actor attribute access |
Comments suppressed due to low confidence (2)
distributed/scheduler.py:2776
- The updated signature of _transition_processing_erred now allows a None value for the worker parameter. Please ensure that all call sites handle this possibility appropriately and update documentation accordingly.
worker: str | None = None,
distributed/tests/test_actor.py:294
- [nitpick] The test now expects a RuntimeError instead of a ValueError for a lost Actor. Ensure that any other exception handling in the codebase is updated consistently to reflect this change.
with pytest.raises(RuntimeError, match="Worker holding Actor was lost"):
If the worker the actor is living on closes, this can corrupt the state machine with errors like
This implements an appropriate transition for these cases and ensures that the result is properly set to erred.