Skip to content

long-running Flow runs stuck cancelling with recent websockets library versions #18744

@masonmenges

Description

@masonmenges

Bug summary

Long-running flow runs, ~2hrs+, remain in a cancelling state and the related flow run continues to run after attempting to cancel the run through the UI/API.

The runs for all intents and purposes appear to still be running without issue, as far as I can tell, but at a certain point the Runner no longer seems to be picking up cancelling states from the server so the cancellation request is ignored. I've only been able to reproduce this in containerized environment so far, this doesn't seem to apply locally or when using serve.

Code Example

import asyncio, os

from prefect import flow, task, get_run_logger
from prefect.futures import wait


@task
async def long_running_task():
    logger = get_run_logger()
    try:
        while True:
            await asyncio.sleep(5)

    except asyncio.CancelledError:
        logger.warning(f"got cancellation signal!")
        raise

    except Exception as ex:
        logger.error(f" got unexpected exception: {ex}")
        raise


@flow(log_prints=True)
async def long_running_flow():
    logger = get_run_logger()
    tasks = []
    results = []

    for _ in range(0, 10):
        task = long_running_task.submit()
        tasks.append(task)
    logger.info(f"All tasks submitted!")

    wait(tasks)    

    try:
        for task in tasks:
            results.append(task.result())
    except asyncio.CancelledError:
        logger.warning(f"main flow got manual cancellation! Notifying all tasks...")
        for task in tasks:

            task.cancel()
        raise
    except Exception as ex:
        logger.error(f"main flow unexpected failed: {ex}")
        raise

On recent versions of prefect 3.4.11 and websockets 13.0 or later after this executes for an extended period of time I'm no long able to cancel the flow run and it remains stuck, nothing super specific pops up in the logs beyond this when this occurs

02:36:46.330 | DEBUG   | prefect.events.clients -   pinging...
02:36:46.367 | DEBUG   | prefect.events.clients -   authenticating...
02:36:46.419 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
02:36:46.420 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 02:35:46.420519+00:00...
02:46:46.316 | DEBUG   | prefect.events.clients - Connection closed with 5/10 attempts
02:46:47.318 | DEBUG   | prefect.events.clients - Reconnecting...
02:46:47.396 | DEBUG   | prefect.events.clients -   pinging...
02:46:47.433 | DEBUG   | prefect.events.clients -   authenticating...
02:46:47.503 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
02:46:47.504 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 02:45:47.504513+00:00...
02:56:47.388 | DEBUG   | prefect.events.clients - Connection closed with 6/10 attempts
02:56:48.389 | DEBUG   | prefect.events.clients - Reconnecting...
02:56:48.472 | DEBUG   | prefect.events.clients -   pinging...
02:56:48.509 | DEBUG   | prefect.events.clients -   authenticating...
02:56:48.558 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
02:56:48.559 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 02:55:48.559070+00:00...
03:06:48.463 | DEBUG   | prefect.events.clients - Connection closed with 7/10 attempts
03:06:49.465 | DEBUG   | prefect.events.clients - Reconnecting...
03:06:49.579 | DEBUG   | prefect.events.clients -   pinging...
03:06:49.615 | DEBUG   | prefect.events.clients -   authenticating...
03:06:49.680 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
03:06:49.681 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 03:05:49.681218+00:00...
03:16:49.580 | DEBUG   | prefect.events.clients - Connection closed with 8/10 attempts
03:16:50.583 | DEBUG   | prefect.events.clients - Reconnecting...
03:16:50.669 | DEBUG   | prefect.events.clients -   pinging...
03:16:50.706 | DEBUG   | prefect.events.clients -   authenticating...
03:16:50.768 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
03:16:50.769 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 03:15:50.769308+00:00...
03:26:50.660 | DEBUG   | prefect.events.clients - Connection closed with 9/10 attempts
03:26:51.661 | DEBUG   | prefect.events.clients - Reconnecting...
03:26:51.743 | DEBUG   | prefect.events.clients -   pinging...
03:26:51.778 | DEBUG   | prefect.events.clients -   authenticating...
03:26:51.873 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
03:26:51.874 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 03:25:51.874528+00:00...
03:36:51.731 | DEBUG   | prefect.events.clients - Connection closed with 10/10 attempts
03:36:52.732 | DEBUG   | prefect.events.clients - Reconnecting...
03:36:52.811 | DEBUG   | prefect.events.clients -   pinging...
03:36:52.849 | DEBUG   | prefect.events.clients -   authenticating...
03:36:52.920 | DEBUG   | prefect.events.clients -   auth result {'type': 'auth_success'}
03:36:52.921 | DEBUG   | prefect.events.clients -   filtering events since 2025-08-19 03:35:52.921601+00:00...
03:46:52.798 | DEBUG   | prefect.events.clients - Connection closed with 11/10 attempts
09:05:20.567 | WARNING | opentelemetry.exporter.otlp.proto.http.trace_exporter - Transient error Service Unavailable encountered while exporting span batch, retrying in 1.14s.
13:32:21.193 | WARNING | opentelemetry.exporter.otlp.proto.http.trace_exporter - Transient error Service Unavailable encountered while exporting span batch, retrying in 0.91s.

On older versions of prefect and from subsequently from what I can tell older versions of the websockets (<13.0) library I'm able to run the above and cancel it successfully after 12+ hours. It's definitely possible this isn't directly related to websockets but at least as far as I can tell that's the most likely place this could be hanging without necessarily resulting in any direct failures here

Version info

Version:             3.4.12
API version:         0.8.4
Python version:      3.12.11
Git commit:          35e04f52
Built:               Fri, Aug 08, 2025 10:43 PM
OS/Arch:             linux/x86_64
Profile:             ephemeral
Server type:         ephemeral
Pydantic version:    2.11.7
Server:
  Database:          sqlite
  SQLite version:    3.40.1
Integrations:
  prefect-redis:     0.2.3

Additional context

Websockets version: 13.1

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggreat writeupThis is a wonderful example of our standards

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions