Finalizers can cause a deadlock that causes the test suite to time out on teardown

Many tests start the kopf controller via the `kopf_runner` fixture and perform work within the context manager that it provides. 


```python
def test_foo(kopf_runner):
    # Start the controller
    with kopf_runner:
        # do work
```

Any resources that use `@kopf.timer`, `@kopf.daemon`, `@kopf.on.shutdown` all set finalizers on the resources they are related to. This means that the controller must be running at the time of their deletion. However we rely a bunch on cascade deletion which means the parent resource can be removed and the test will exit, but the child resources will take some amount of time to be reaped which can result in a race condition that ends in a deadlock. If the test has exited the controller is no longer running, and therefore the reaping will hang.

This happens in `dask_kubernetes/experimental/tests/test_kubecluster.py::test_adapt`.

```python
def test_adapt(kopf_runner, docker_image):
    # Start the controller
    with kopf_runner:

        # Create a DaskCluster resource with KubeCluster
        with KubeCluster(name="adaptive", image=docker_image,  n_workers=0) as cluster:

            # Create a DaskAutoscaler resource with cluster.adapt
            # This resource uses a @kopf.timer to periodically update the number of workers
            # and therefore has a finalizer to deactivate the timer
            cluster.adapt(minimum=0, maximum=1)

            # Do some testing
            with Client(cluster) as client:
                assert client.submit(lambda x: x + 1, 10).result() == 11

        # When leaving the KubeCluster context manager the DaskCluster resource is deleted and
        # the DaskAutoscaler will be deleted at some later point via cascade deletion

    # When leaving the kopf_runner context manager the controller is stopped

    # Eventually the DaskAutoscaler is reaped but there is no controller to handle the finalizer 
    # so it never gets deleted
```

The hang happens when the CRDs are deleted by the `customresources` fixture because if there are any resources with finalizers left over the CRDs can never be deleted and the CI will time out.

```python-traceback
 ==================================== ERRORS ====================================
_______________________ ERROR at teardown of test_adapt ________________________

k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fbf79595040>

    @pytest.fixture(scope="session", autouse=True)
    def customresources(k8s_cluster):
    
        temp_dir = tempfile.TemporaryDirectory()
        crd_path = os.path.join(DIR, "operator", "customresources")
    
        for crd in ["daskcluster", "daskworkergroup", "daskjob", "daskautoscaler"]:
            run_generate(
                os.path.join(crd_path, f"{crd}.yaml"),
                os.path.join(crd_path, f"{crd}.patch.yaml"),
                os.path.join(temp_dir.name, f"{crd}.yaml"),
            )
    
        k8s_cluster.kubectl("apply", "-f", temp_dir.name)
        yield
>       k8s_cluster.kubectl("delete", "-f", temp_dir.name)

...

E               Failed: Timeout >300.0s
```

To workaround this particular example we can call `cluster.scale(0)` after we test the adaptive behaviour which will delete the `DaskAutoscaler` before we leave the `kopf_runner`. 

Ideally we should update the `customresources` fixture to walk through all custom resources and remove any finalizers before we try to delete the CRDs themselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Finalizers can cause a deadlock that causes the test suite to time out on teardown #546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Finalizers can cause a deadlock that causes the test suite to time out on teardown #546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions