Skip to content

Flakiness in test_shuffle.py #8074

Open
@crusaderky

Description

@crusaderky

Several tests in test_shuffle.py are very flaky.

If I change .github/workflows/tests.yaml as follows, to rerun the tests 20 times (ci1 + not ci1) per environment:

          pytest distributed/shuffle/tests/test_shuffle.py --count=10 --runslow \
              --leaks=...

I get the following failure rates:

test n. failures
distributed/shuffle/tests/test_shuffle.py::test_clean_after_close 1
distributed/shuffle/tests/test_shuffle.py::test_closed_input_only_worker_during_transfer 1
distributed/shuffle/tests/test_shuffle.py::test_closed_worker_during_transfer 29
distributed/shuffle/tests/test_shuffle.py::test_crashed_worker_during_transfer 6
distributed/shuffle/tests/test_shuffle.py::test_restarting_during_transfer_raises_killed_worker 38

Additionally, test_crashed_worker_during_transfer deadlocks in a way that's irrecoverable on Windows, causing the whole test suite to be killed by

distributed/pyproject.toml

Lines 155 to 162 in eb297b3

# pytest-timeout settings
# 'thread' kills off the whole test suite. 'signal' only kills the offending test.
# However, 'signal' doesn't work on Windows (due to lack of SIGALRM).
# The CI script modifies this config file on the fly on Linux and MacOS.
timeout_method = "thread"
# This should not be reduced; Windows CI has been observed to be occasionally
# exceptionally slow.
timeout = 300

logs: https://github.com/crusaderky/distributed/actions/runs/5761255813

CC @hendrikmakait

Metadata

Metadata

Assignees

No one assigned

    Labels

    flaky testIntermittent failures on CI.testsUnit tests and/or continuous integration

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions