Description
Many tests start the kopf controller via the kopf_runner
fixture and perform work within the context manager that it provides.
def test_foo(kopf_runner):
# Start the controller
with kopf_runner:
# do work
Any resources that use @kopf.timer
, @kopf.daemon
, @kopf.on.shutdown
all set finalizers on the resources they are related to. This means that the controller must be running at the time of their deletion. However we rely a bunch on cascade deletion which means the parent resource can be removed and the test will exit, but the child resources will take some amount of time to be reaped which can result in a race condition that ends in a deadlock. If the test has exited the controller is no longer running, and therefore the reaping will hang.
This happens in dask_kubernetes/experimental/tests/test_kubecluster.py::test_adapt
.
def test_adapt(kopf_runner, docker_image):
# Start the controller
with kopf_runner:
# Create a DaskCluster resource with KubeCluster
with KubeCluster(name="adaptive", image=docker_image, n_workers=0) as cluster:
# Create a DaskAutoscaler resource with cluster.adapt
# This resource uses a @kopf.timer to periodically update the number of workers
# and therefore has a finalizer to deactivate the timer
cluster.adapt(minimum=0, maximum=1)
# Do some testing
with Client(cluster) as client:
assert client.submit(lambda x: x + 1, 10).result() == 11
# When leaving the KubeCluster context manager the DaskCluster resource is deleted and
# the DaskAutoscaler will be deleted at some later point via cascade deletion
# When leaving the kopf_runner context manager the controller is stopped
# Eventually the DaskAutoscaler is reaped but there is no controller to handle the finalizer
# so it never gets deleted
The hang happens when the CRDs are deleted by the customresources
fixture because if there are any resources with finalizers left over the CRDs can never be deleted and the CI will time out.
==================================== ERRORS ====================================
_______________________ ERROR at teardown of test_adapt ________________________
k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fbf79595040>
@pytest.fixture(scope="session", autouse=True)
def customresources(k8s_cluster):
temp_dir = tempfile.TemporaryDirectory()
crd_path = os.path.join(DIR, "operator", "customresources")
for crd in ["daskcluster", "daskworkergroup", "daskjob", "daskautoscaler"]:
run_generate(
os.path.join(crd_path, f"{crd}.yaml"),
os.path.join(crd_path, f"{crd}.patch.yaml"),
os.path.join(temp_dir.name, f"{crd}.yaml"),
)
k8s_cluster.kubectl("apply", "-f", temp_dir.name)
yield
> k8s_cluster.kubectl("delete", "-f", temp_dir.name)
...
E Failed: Timeout >300.0s
To workaround this particular example we can call cluster.scale(0)
after we test the adaptive behaviour which will delete the DaskAutoscaler
before we leave the kopf_runner
.
Ideally we should update the customresources
fixture to walk through all custom resources and remove any finalizers before we try to delete the CRDs themselves.