Avoid handling stale long-running messages on scheduler#8991
Conversation
…ated to process the task
| if worker not in self.workers: | ||
| logger.debug( | ||
| "Received long-running signal from unknown worker %s. Ignoring.", worker | ||
| ) | ||
| return | ||
|
|
There was a problem hiding this comment.
This is mostly for good measure, I think it should the code should also work without this.
| steal = self.extensions.get("stealing") | ||
| if steal is not None: | ||
| steal.remove_key_from_stealable(ts) | ||
|
|
There was a problem hiding this comment.
I haven't tested the move of this code, but I'm certain that we should deal with staleness before taking any meaningful actions.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 11h 34m 41s ⏱️ +35s For more details on these failures and errors, see this check. Results for commit 7065f02. ± Comparison against base commit 49f5e74. This pull request removes 2 and adds 6 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
distributed/scheduler.py
Outdated
| logger.debug("Received long-running signal from duplicate task. Ignoring.") | ||
| return | ||
|
|
||
| if ws.address != worker: |
There was a problem hiding this comment.
Ideally there was a more reliable way to verify the requests integrity.
A chain like this
processing -> long running -> released -> processing (without a long running transition)
that happens on the same worker would still recognize a stale event as valid. However, I doubt this is a relevant scenario in practice.
| # Assert that the handler did not fail and no state was corrupted | ||
| logs = caplog.getvalue() | ||
| assert not logs | ||
| assert not wsB.task_prefix_count |
There was a problem hiding this comment.
I would prefer a test that does not rely on logging. Is this corruption detectable with validate? (If not, can it be made detectable with this?)
There was a problem hiding this comment.
Good point, let me check.
There was a problem hiding this comment.
This doesn't seem to work out of the box. We'd either have ti o log (or hard-fail) on errors in the stimulus or validate that scheduler and worker state don't drift apart.
| # Submit task and wait until it executes on a | ||
| x = c.submit( | ||
| f, | ||
| block_secede, | ||
| block_long_running, | ||
| key="x", | ||
| workers=[a.address], | ||
| ) | ||
| await wait_for_state("x", "executing", a) | ||
|
|
||
| with captured_logger("distributed.scheduler", logging.ERROR) as caplog: | ||
| with freeze_batched_send(a.batched_stream): |
There was a problem hiding this comment.
For review (and future maintainability) it might be helpful to briefly document in a sentence or two what the below code is constructing and asserting
| key="x", | ||
| workers=[a.address], | ||
| ) | ||
| await wait_for_state("x", "executing", a) |
There was a problem hiding this comment.
since you're already dealing with so many events above, why not using an event for this as well? Is it important to interrupt as soon as the task is in this state, i.e. before it's executed on the TPE?
There was a problem hiding this comment.
Works for me, no strong preference over polling vs. adding yet another event. I felt that this was a bit easier to read but I guess YMMV.
|
|
||
| def f(block_secede, block_long_running): | ||
| block_secede.wait() | ||
| distributed.secede() |
There was a problem hiding this comment.
does this also trigger when using worker_client? The secede is an API I typically discourage from using. Mostly because the counterpart rejoin is quite broken
There was a problem hiding this comment.
I strongly suppose it does. The original workload where this popped up had many clients connected to the scheduler.
fjetter
left a comment
There was a problem hiding this comment.
LGTM. If we can rewrite the test to use validate (or extend validate) that'd be great but it's not a blocker
pre-commit run --all-files