-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Bug summary
Long-running flow runs, ~2hrs+, remain in a cancelling state and the related flow run continues to run after attempting to cancel the run through the UI/API.
The runs for all intents and purposes appear to still be running without issue, as far as I can tell, but at a certain point the Runner no longer seems to be picking up cancelling states from the server so the cancellation request is ignored. I've only been able to reproduce this in containerized environment so far, this doesn't seem to apply locally or when using serve.
Code Example
import asyncio, os
from prefect import flow, task, get_run_logger
from prefect.futures import wait
@task
async def long_running_task():
logger = get_run_logger()
try:
while True:
await asyncio.sleep(5)
except asyncio.CancelledError:
logger.warning(f"got cancellation signal!")
raise
except Exception as ex:
logger.error(f" got unexpected exception: {ex}")
raise
@flow(log_prints=True)
async def long_running_flow():
logger = get_run_logger()
tasks = []
results = []
for _ in range(0, 10):
task = long_running_task.submit()
tasks.append(task)
logger.info(f"All tasks submitted!")
wait(tasks)
try:
for task in tasks:
results.append(task.result())
except asyncio.CancelledError:
logger.warning(f"main flow got manual cancellation! Notifying all tasks...")
for task in tasks:
task.cancel()
raise
except Exception as ex:
logger.error(f"main flow unexpected failed: {ex}")
raiseOn recent versions of prefect 3.4.11 and websockets 13.0 or later after this executes for an extended period of time I'm no long able to cancel the flow run and it remains stuck, nothing super specific pops up in the logs beyond this when this occurs
02:36:46.330 | DEBUG | prefect.events.clients - pinging...
02:36:46.367 | DEBUG | prefect.events.clients - authenticating...
02:36:46.419 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
02:36:46.420 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 02:35:46.420519+00:00...
02:46:46.316 | DEBUG | prefect.events.clients - Connection closed with 5/10 attempts
02:46:47.318 | DEBUG | prefect.events.clients - Reconnecting...
02:46:47.396 | DEBUG | prefect.events.clients - pinging...
02:46:47.433 | DEBUG | prefect.events.clients - authenticating...
02:46:47.503 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
02:46:47.504 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 02:45:47.504513+00:00...
02:56:47.388 | DEBUG | prefect.events.clients - Connection closed with 6/10 attempts
02:56:48.389 | DEBUG | prefect.events.clients - Reconnecting...
02:56:48.472 | DEBUG | prefect.events.clients - pinging...
02:56:48.509 | DEBUG | prefect.events.clients - authenticating...
02:56:48.558 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
02:56:48.559 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 02:55:48.559070+00:00...
03:06:48.463 | DEBUG | prefect.events.clients - Connection closed with 7/10 attempts
03:06:49.465 | DEBUG | prefect.events.clients - Reconnecting...
03:06:49.579 | DEBUG | prefect.events.clients - pinging...
03:06:49.615 | DEBUG | prefect.events.clients - authenticating...
03:06:49.680 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
03:06:49.681 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 03:05:49.681218+00:00...
03:16:49.580 | DEBUG | prefect.events.clients - Connection closed with 8/10 attempts
03:16:50.583 | DEBUG | prefect.events.clients - Reconnecting...
03:16:50.669 | DEBUG | prefect.events.clients - pinging...
03:16:50.706 | DEBUG | prefect.events.clients - authenticating...
03:16:50.768 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
03:16:50.769 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 03:15:50.769308+00:00...
03:26:50.660 | DEBUG | prefect.events.clients - Connection closed with 9/10 attempts
03:26:51.661 | DEBUG | prefect.events.clients - Reconnecting...
03:26:51.743 | DEBUG | prefect.events.clients - pinging...
03:26:51.778 | DEBUG | prefect.events.clients - authenticating...
03:26:51.873 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
03:26:51.874 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 03:25:51.874528+00:00...
03:36:51.731 | DEBUG | prefect.events.clients - Connection closed with 10/10 attempts
03:36:52.732 | DEBUG | prefect.events.clients - Reconnecting...
03:36:52.811 | DEBUG | prefect.events.clients - pinging...
03:36:52.849 | DEBUG | prefect.events.clients - authenticating...
03:36:52.920 | DEBUG | prefect.events.clients - auth result {'type': 'auth_success'}
03:36:52.921 | DEBUG | prefect.events.clients - filtering events since 2025-08-19 03:35:52.921601+00:00...
03:46:52.798 | DEBUG | prefect.events.clients - Connection closed with 11/10 attempts
09:05:20.567 | WARNING | opentelemetry.exporter.otlp.proto.http.trace_exporter - Transient error Service Unavailable encountered while exporting span batch, retrying in 1.14s.
13:32:21.193 | WARNING | opentelemetry.exporter.otlp.proto.http.trace_exporter - Transient error Service Unavailable encountered while exporting span batch, retrying in 0.91s.
On older versions of prefect and from subsequently from what I can tell older versions of the websockets (<13.0) library I'm able to run the above and cancel it successfully after 12+ hours. It's definitely possible this isn't directly related to websockets but at least as far as I can tell that's the most likely place this could be hanging without necessarily resulting in any direct failures here
Version info
Version: 3.4.12
API version: 0.8.4
Python version: 3.12.11
Git commit: 35e04f52
Built: Fri, Aug 08, 2025 10:43 PM
OS/Arch: linux/x86_64
Profile: ephemeral
Server type: ephemeral
Pydantic version: 2.11.7
Server:
Database: sqlite
SQLite version: 3.40.1
Integrations:
prefect-redis: 0.2.3
Additional context
Websockets version: 13.1