Skip to content

Ensure actors set erred state properly in case of worker failure#9067

Merged
fjetter merged 1 commit intodask:mainfrom
fjetter:actor_transition_errors
May 6, 2025
Merged

Ensure actors set erred state properly in case of worker failure#9067
fjetter merged 1 commit intodask:mainfrom
fjetter:actor_transition_errors

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented May 6, 2025

If the worker the actor is living on closes, this can corrupt the state machine with errors like


distributed.scheduler - ERROR - SchedulerState._transition_processing_erred() missing 1 required positional argument: 'worker'
Traceback (most recent call last):
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/utils.py", line 805, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 5539, in remove_worker
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 8231, in transitions
    self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 2124, in _transitions
    new_recs, new_cmsgs, new_wmsgs = self._transition(key, finish, stimulus_id)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 2007, in _transition
    recommendations, client_msgs, worker_msgs = func(
                                                ^^^^^
TypeError: SchedulerState._transition_processing_erred() missing 1 required positional argument: 'worker'
2025-05-02 06:02:49.6990
scheduler

distributed.scheduler - ERROR - Error transitioning 'train_fold-ea3e90ffa6cd2aefdd048600a1929a31' from 'processing' to 'erred'
Traceback (most recent call last):
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 2007, in _transition
    recommendations, client_msgs, worker_msgs = func(
                                                ^^^^^
TypeError: SchedulerState._transition_processing_erred() missing 1 required positional argument: 'worker'
2025-05-02 06:02:49.6980

This implements an appropriate transition for these cases and ensures that the result is properly set to erred.

@github-actions
Copy link
Contributor

github-actions bot commented May 6, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   10h 23m 20s ⏱️ + 1m 7s
 4 109 tests + 2   3 991 ✅ + 1    105 💤 ±0  3 ❌ +1  10 🔥 ±0 
47 471 runs  +24  45 263 ✅ +23  2 185 💤 ±0  3 ❌ +1  20 🔥 ±0 

For more details on these failures and errors, see this check.

Results for commit ba4a9ce. ± Comparison against base commit 358402d.

@fjetter fjetter requested a review from Copilot May 6, 2025 13:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses an issue where actor error transitions did not correctly set the erred state when a worker fails. Key changes include:

  • Updating exception types from ValueError to RuntimeError in actor methods and test assertions.
  • Adjusting the scheduler’s state transition functions to support an optional worker argument and introducing a new transition for in‐memory erred tasks.
  • Modifying test cases to simulate both graceful and abrupt worker exits.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
distributed/tests/test_actor.py Updates to test cases to expect RuntimeError and simulate actor failure scenarios
distributed/scheduler.py Updates to state transition functions including signature changes and new helper for memory erred state
distributed/actor.py Consistent exception type change to RuntimeError for actor attribute access
Comments suppressed due to low confidence (2)

distributed/scheduler.py:2776

  • The updated signature of _transition_processing_erred now allows a None value for the worker parameter. Please ensure that all call sites handle this possibility appropriately and update documentation accordingly.
worker: str | None = None,

distributed/tests/test_actor.py:294

  • [nitpick] The test now expects a RuntimeError instead of a ValueError for a lost Actor. Ensure that any other exception handling in the codebase is updated consistently to reflect this change.
with pytest.raises(RuntimeError, match="Worker holding Actor was lost"):

@fjetter fjetter merged commit beb511b into dask:main May 6, 2025
26 of 34 checks passed
@fjetter fjetter deleted the actor_transition_errors branch May 6, 2025 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant