Skip to content

Preload code provided by --preload does not kill worker on failure #8754

Open
@Fogapod

Description

@Fogapod

Describe the issue:

When providing custom code for worker, failures do not propagate and are just printed to console.
Relevant code:

try:
await preload.start()
except Exception:
logger.exception("Failed to start preload: %s", preload.name)

I run critical setup code in preload so I need worker to fail if some database failed to connect instead of getting runtime errors. Currently I use the following hack to at least prevent workers from registering in scheduler:

from dask.distributed import Worker

# dask calls this function
# all code which might fail must exist inside try block
async def dask_setup(worker: Worker):
    try:
        from backend.dask.preload import preload

        await preload(worker)
    except Exception as e:
        import sys

        print("preload failed:", e)

        # explicitly exit to prevent worker from running without preload code
        # this does not kill pod but prevents it from registering in scheduler
        # worker.stop() does not work for some reason
        sys.exit(1)

Minimal Complete Verifiable Example:

async def dask_setup(worker):
    1 / 0

Will result in:

2024-06-28 14:15:26,013 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.244.9.106:44589'
2024-06-28 14:15:26,476 - distributed.preloading - INFO - Creating preload: /opt/backend/dask/bin.runfiles/_main/backend/dask/preload_entrypoint.py
2024-06-28 14:15:26,477 - distributed.utils - INFO - Reload module preload_entrypoint from .py file
2024-06-28 14:15:26,477 - distributed.preloading - INFO - Import preload module: /opt/backend/dask/bin.runfiles/_main/backend/dask/preload_entrypoint.py
2024-06-28 14:15:26,876 - distributed.preloading - INFO - Run preload setup: /opt/backend/dask/bin.runfiles/_main/backend/dask/preload_entrypoint.py
2024-06-28 14:15:26,876 - distributed.preloading - ERROR - Failed to start preload: /opt/backend/dask/bin.runfiles/_main/backend/dask/preload_entrypoint.py
Traceback (most recent call last):
  File "/opt/backend/dask/bin.runfiles/rules_python~0.26.0~pip~pypi_311_distributed/site-packages/distributed/preloading.py", line 234, in start
    await preload.start()
  File "/opt/backend/dask/bin.runfiles/rules_python~0.26.0~pip~pypi_311_distributed/site-packages/distributed/preloading.py", line 213, in start
    await future
  File "/tmp/dask-scratch-space/worker-78qpzjdj/preload_entrypoint.py", line 8, in dask_setup
    1 / 0
    ~~^~~
ZeroDivisionError: division by zero

But worker keeps running as if nothing happened.

Anything else we need to know?:

Environment:

  • Dask version: 2024.5.2
  • Python version: 3.11
  • Operating System: google's distroless_python
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions