Summary
flytekit.remote.FlyteRemote.sync_node_execution recursively syncs launched-LP executions and parent-node children with no depth limit, no visited-set, and no cycle detection. A workflow graph with deeply nested launched LPs (or, in principle, a server response producing a cycle) crashes the client with RecursionError.
Affected lines (flytekit master @ 41a9f8009128db1391697cec0441e924e7e27d68):
None of these have a depth bound, visited-set, or cycle check. Note this is a different code path from #6147 (which was tracker.py:_resolve_abs_module_name).
Impact
A user with workflow-define permissions can author a workflow whose launched-LP nesting depth exceeds Python's default recursion limit (~1000) — or, more practically, whose nesting is deeper than sys.getrecursionlimit() minus the call-stack overhead at the point sync_node_execution is invoked. When ANY caller (CLI, web UI, programmatic) calls FlyteRemote.sync() on such an execution, the Python process terminates with RecursionError.
This is a per-client DoS (not server-wide), reachable only by authenticated users with workflow-define rights. The author's own users (and themselves) are affected; multi-tenant Flyte deployments amplify the radius.
The realistic trigger isn't a cycle — the Flyte server should reject those during compile — but rather deeply nested launch plans in legitimate workflows that grow over time. Once the user's workflow chain crosses the recursion threshold, every pyflyte run --remote against that execution crashes.
Reproducer
The pytest tests below pass on master @ 41a9f80. Drop into tests/flytekit/unit/remote/test_recursion_repro.py and run:
pytest tests/flytekit/unit/remote/test_recursion_repro.py -v
"""Reproducer: FlyteRemote.sync_node_execution unbounded recursion."""
from __future__ import annotations
import inspect
import sys
from unittest.mock import MagicMock, patch
import pytest
from flytekit.configuration import Config
from flytekit.remote.remote import FlyteRemote
@pytest.fixture
def remote():
with patch("flytekit.clients.friendly.SynchronousFlyteClient"):
flyte_remote = FlyteRemote(
config=Config.auto(),
default_project="p1",
default_domain="d1",
)
flyte_remote._client_initialized = True
flyte_remote._client = MagicMock()
return flyte_remote
def test_sync_node_execution_lacks_depth_or_visited_check(remote):
"""Static-analysis assertion: the recursive call sites lack any
depth/visited mechanism. Source-level evidence of the gap."""
sync_node_src = inspect.getsource(remote.sync_node_execution)
sync_exec_src = inspect.getsource(remote.sync_execution)
# Confirms the recursive call sites exist
assert "self.sync_execution(launched_exec" in sync_node_src
assert "self.sync_node_execution(" in sync_node_src
# No depth parameter or visited set
assert "depth" not in sync_node_src.lower() or "depth=" not in sync_node_src
assert "visited" not in sync_node_src.lower()
assert "depth" not in sync_exec_src.lower() or "max_depth" not in sync_exec_src
def test_recursion_limit_demonstration():
"""Synthetic mirror of the sync_node_execution call shape; crashes
with RecursionError when given a deeply-nested input graph at low
recursion limit. Shows the consequence of the missing bound."""
def synthetic_sync_node_execution(node, depth_observed=None):
if depth_observed is None:
depth_observed = []
depth_observed.append(len(depth_observed))
if node.get("launched_exec_id"):
for child in node["launched_exec_id"].get("nodes", []):
synthetic_sync_node_execution(child, depth_observed)
return depth_observed
def chain(n):
# ITERATIVE builder so we don't hit the recursion limit at
# construction time
node = {"launched_exec_id": None}
for _ in range(n):
node = {"launched_exec_id": {"nodes": [node]}}
return node
deep_node = chain(60)
original = sys.getrecursionlimit()
try:
sys.setrecursionlimit(50)
with pytest.raises(RecursionError):
synthetic_sync_node_execution(deep_node)
finally:
sys.setrecursionlimit(original)
Both tests pass on master @ 41a9f80:
tests/flytekit/unit/remote/test_recursion_repro.py::test_sync_node_execution_lacks_depth_or_visited_check PASSED
tests/flytekit/unit/remote/test_recursion_repro.py::test_recursion_limit_demonstration PASSED
The first test is source-level evidence. The second is a runtime demonstration of the call-shape — the synthetic mirror crashes under a low limit, mirroring what happens to flytekit users whose workflow graphs exceed sys.getrecursionlimit().
Suggested fix
Add a depth parameter that's incremented on every recursive call:
def sync_execution(
self,
execution: FlyteWorkflowExecution,
entity_definition: typing.Union[FlyteWorkflow, FlyteTask] = None,
sync_nodes: bool = False,
+ _depth: int = 0,
+ _max_depth: int = 50,
) -> FlyteWorkflowExecution:
+ if _depth > _max_depth:
+ raise FlyteAssertion(
+ f"Workflow execution {execution.id} has nesting deeper than "
+ f"_max_depth={_max_depth}; refusing to recurse further to "
+ f"avoid RecursionError."
+ )
...
if sync_nodes:
node_execs[n.id.node_id] = self.sync_node_execution(
- n, node_mapping
+ n, node_mapping, _depth=_depth + 1, _max_depth=_max_depth,
)
_max_depth=50 is a generous default — well above any reasonable user nesting and far below Python's default recursion limit. Users with legitimately-deep graphs can pass a larger value; everyone else gets a clear FlyteAssertion instead of a RecursionError.
A visited-set keyed on WorkflowExecutionIdentifier would catch genuine cycles separately (server-side cycles shouldn't reach the client, but defense-in-depth is cheap).
Environment
- flytekit:
master @ 41a9f8009128db1391697cec0441e924e7e27d68
- Python: 3.12, default recursion limit 1000
- Reproducer environment:
pip install -e . from flytekit checkout
Severity
Medium-Low reliability flaw, not a security finding. Per-client DoS, requires authenticated workflow-define rights, fix is small.
Summary
flytekit.remote.FlyteRemote.sync_node_executionrecursively syncs launched-LP executions and parent-node children with no depth limit, no visited-set, and no cycle detection. A workflow graph with deeply nested launched LPs (or, in principle, a server response producing a cycle) crashes the client withRecursionError.Affected lines (flytekit
master @ 41a9f8009128db1391697cec0441e924e7e27d68):remote.py:2683-2686— recursion intoself.sync_execution(launched_exec, sync_nodes=True)for nodes withclosure.workflow_node_metadata.remote.py:2716-2719— recursion intosync_node_executionfor dynamic-subworkflow children.remote.py:2730-2733— same for static subworkflow children.remote.py:2751-2754— same for branch-node children.None of these have a depth bound, visited-set, or cycle check. Note this is a different code path from #6147 (which was
tracker.py:_resolve_abs_module_name).Impact
A user with workflow-define permissions can author a workflow whose launched-LP nesting depth exceeds Python's default recursion limit (~1000) — or, more practically, whose nesting is deeper than
sys.getrecursionlimit()minus the call-stack overhead at the pointsync_node_executionis invoked. When ANY caller (CLI, web UI, programmatic) callsFlyteRemote.sync()on such an execution, the Python process terminates withRecursionError.This is a per-client DoS (not server-wide), reachable only by authenticated users with workflow-define rights. The author's own users (and themselves) are affected; multi-tenant Flyte deployments amplify the radius.
The realistic trigger isn't a cycle — the Flyte server should reject those during compile — but rather deeply nested launch plans in legitimate workflows that grow over time. Once the user's workflow chain crosses the recursion threshold, every
pyflyte run --remoteagainst that execution crashes.Reproducer
The pytest tests below pass on
master @ 41a9f80. Drop intotests/flytekit/unit/remote/test_recursion_repro.pyand run:Both tests pass on
master @ 41a9f80:The first test is source-level evidence. The second is a runtime demonstration of the call-shape — the synthetic mirror crashes under a low limit, mirroring what happens to flytekit users whose workflow graphs exceed
sys.getrecursionlimit().Suggested fix
Add a depth parameter that's incremented on every recursive call:
_max_depth=50is a generous default — well above any reasonable user nesting and far below Python's default recursion limit. Users with legitimately-deep graphs can pass a larger value; everyone else gets a clearFlyteAssertioninstead of aRecursionError.A visited-set keyed on
WorkflowExecutionIdentifierwould catch genuine cycles separately (server-side cycles shouldn't reach the client, but defense-in-depth is cheap).Environment
master @ 41a9f8009128db1391697cec0441e924e7e27d68pip install -e .from flytekit checkoutSeverity
Medium-Low reliability flaw, not a security finding. Per-client DoS, requires authenticated workflow-define rights, fix is small.