Avoid handling stale long-running messages on scheduler#8991

Merged

hendrikmakait merged 6 commits intodask:mainfrom

hendrikmakait:avoid-scheduler-corruption-due-to-stale-long-running-msg

Feb 6, 2025

Member

hendrikmakait commented Jan 21, 2025

Tests added / passed
Passes pre-commit run --all-files


          Only handle long-running message if task if sent by the worker design…

6f6c36a

…ated to process the task

hendrikmakait requested a review from fjetter as a code owner

January 21, 2025 16:41

hendrikmakait added 2 commits

January 21, 2025 17:45


          Simplify test

82ef6b0


          Move stealing code

88c42f7

hendrikmakait commented

View reviewed changes

distributed/scheduler.py

Comment on lines +6067 to +6072

+                      if worker not in self.workers:
+                          logger.debug(
+                              "Received long-running signal from unknown worker %s. Ignoring.", worker
+                          )
+                          return

Member Author

hendrikmakait Jan 21, 2025

This is mostly for good measure, I think it should the code should also work without this.

hendrikmakait commented

View reviewed changes

distributed/scheduler.py

Comment on lines +6092 to +6095

+                      steal = self.extensions.get("stealing")
+                      if steal is not None:
+                          steal.remove_key_from_stealable(ts)

Member Author

hendrikmakait Jan 21, 2025

I haven't tested the move of this code, but I'm certain that we should deal with staleness before taking any meaningful actions.

Member

fjetter Jan 28, 2025

yes, absolutely

Contributor

github-actions bot commented Jan 21, 2025 •

edited

Loading

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files ± 0 27 suites ±0 11h 34m 41s ⏱️ +35s
4 120 tests + 4 4 002 ✅ + 2 111 💤 ±0 6 ❌ +2 1 🔥 ±0
51 668 runs +52 49 365 ✅ +52 2 296 💤 ±0 6 ❌ ±0 1 🔥 ±0

For more details on these failures and errors, see this check.

Results for commit 7065f02. ± Comparison against base commit 49f5e74.

This pull request removes 2 and adds 6 tests. Note that renamed tests count towards both.

distributed.tests.test_scheduler ‑ test_get_task_duration
distributed.tests.test_steal ‑ test_parse_stealing_interval[None-100]

distributed.tests.test_cancelled_state ‑ test_secede_racing_cancellation_and_scheduling_on_other_worker
distributed.tests.test_cancelled_state ‑ test_secede_racing_resuming_on_same_worker
distributed.tests.test_scheduler ‑ test_get_prefix_duration
distributed.tests.test_steal ‑ test_do_not_ping_pong
distributed.tests.test_steal ‑ test_parse_stealing_interval[None-1000]
distributed.tests.test_steal ‑ test_stealing_objective_accounts_for_in_flight

♻️ This comment has been updated with latest results.

fjetter reviewed

View reviewed changes

distributed/scheduler.py Outdated

                           logger.debug("Received long-running signal from duplicate task. Ignoring.")
                           return
+                      if ws.address != worker:

Member

fjetter Jan 28, 2025

Ideally there was a more reliable way to verify the requests integrity.

A chain like this

processing -> long running -> released -> processing (without a long running transition)

that happens on the same worker would still recognize a stale event as valid. However, I doubt this is a relevant scenario in practice.

fjetter reviewed

View reviewed changes

distributed/tests/test_cancelled_state.py

Comment on lines +1388 to +1391

+                  # Assert that the handler did not fail and no state was corrupted
+                  logs = caplog.getvalue()
+                  assert not logs
+                  assert not wsB.task_prefix_count

Member

fjetter Jan 28, 2025

I would prefer a test that does not rely on logging. Is this corruption detectable with validate? (If not, can it be made detectable with this?)

Member Author

hendrikmakait Jan 29, 2025

Good point, let me check.

Member Author

hendrikmakait Feb 5, 2025

This doesn't seem to work out of the box. We'd either have ti o log (or hard-fail) on errors in the stimulus or validate that scheduler and worker state don't drift apart.

distributed/tests/test_cancelled_state.py

Comment on lines 1321 to 1332

+                  # Submit task and wait until it executes on a
+                  x = c.submit(
+                      f,
+                      block_secede,
+                      block_long_running,
+                      key="x",
+                      workers=[a.address],
+                  )
+                  await wait_for_state("x", "executing", a)
+                  with captured_logger("distributed.scheduler", logging.ERROR) as caplog:
+                      with freeze_batched_send(a.batched_stream):

Member

fjetter Jan 28, 2025

For review (and future maintainability) it might be helpful to briefly document in a sentence or two what the below code is constructing and asserting

Member Author

hendrikmakait Feb 5, 2025

Done.

distributed/tests/test_cancelled_state.py Outdated

+                      key="x",
+                      workers=[a.address],
+                  )
+                  await wait_for_state("x", "executing", a)

Member

fjetter Jan 28, 2025

since you're already dealing with so many events above, why not using an event for this as well? Is it important to interrupt as soon as the task is in this state, i.e. before it's executed on the TPE?

Member Author

hendrikmakait Jan 29, 2025

Works for me, no strong preference over polling vs. adding yet another event. I felt that this was a bit easier to read but I guess YMMV.

Member Author

hendrikmakait Feb 5, 2025

Done.

distributed/tests/test_cancelled_state.py

+                  def f(block_secede, block_long_running):
+                      block_secede.wait()
+                      distributed.secede()

Member

fjetter Jan 28, 2025

does this also trigger when using worker_client? The secede is an API I typically discourage from using. Mostly because the counterpart rejoin is quite broken

Member Author

hendrikmakait Jan 29, 2025

I strongly suppose it does. The original workload where this popped up had many clients connected to the scheduler.

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

fjetter approved these changes

View reviewed changes

Member

fjetter left a comment

LGTM. If we can rewrite the test to use validate (or extend validate) that'd be great but it's not a blocker

hendrikmakait added 3 commits

February 5, 2025 14:24


          Send run ID in long-running message

fb773c2


          Event instead of long-polling

b44f1a6


          handle stale message from the same worker

7065f02

hendrikmakait merged commit 55bb639 into dask:main

26 of 31 checks passed

hendrikmakait deleted the avoid-scheduler-corruption-due-to-stale-long-running-msg branch

February 6, 2025 16:01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet