Skip to content

[BUG] FlyteRemote.sync_node_execution: unbounded recursion (no depth check, no cycle detection) #7338

@Lougarou

Description

@Lougarou

Summary

flytekit.remote.FlyteRemote.sync_node_execution recursively syncs launched-LP executions and parent-node children with no depth limit, no visited-set, and no cycle detection. A workflow graph with deeply nested launched LPs (or, in principle, a server response producing a cycle) crashes the client with RecursionError.

Affected lines (flytekit master @ 41a9f8009128db1391697cec0441e924e7e27d68):

None of these have a depth bound, visited-set, or cycle check. Note this is a different code path from #6147 (which was tracker.py:_resolve_abs_module_name).

Impact

A user with workflow-define permissions can author a workflow whose launched-LP nesting depth exceeds Python's default recursion limit (~1000) — or, more practically, whose nesting is deeper than sys.getrecursionlimit() minus the call-stack overhead at the point sync_node_execution is invoked. When ANY caller (CLI, web UI, programmatic) calls FlyteRemote.sync() on such an execution, the Python process terminates with RecursionError.

This is a per-client DoS (not server-wide), reachable only by authenticated users with workflow-define rights. The author's own users (and themselves) are affected; multi-tenant Flyte deployments amplify the radius.

The realistic trigger isn't a cycle — the Flyte server should reject those during compile — but rather deeply nested launch plans in legitimate workflows that grow over time. Once the user's workflow chain crosses the recursion threshold, every pyflyte run --remote against that execution crashes.

Reproducer

The pytest tests below pass on master @ 41a9f80. Drop into tests/flytekit/unit/remote/test_recursion_repro.py and run:

pytest tests/flytekit/unit/remote/test_recursion_repro.py -v
"""Reproducer: FlyteRemote.sync_node_execution unbounded recursion."""
from __future__ import annotations

import inspect
import sys
from unittest.mock import MagicMock, patch

import pytest

from flytekit.configuration import Config
from flytekit.remote.remote import FlyteRemote


@pytest.fixture
def remote():
    with patch("flytekit.clients.friendly.SynchronousFlyteClient"):
        flyte_remote = FlyteRemote(
            config=Config.auto(),
            default_project="p1",
            default_domain="d1",
        )
        flyte_remote._client_initialized = True
        flyte_remote._client = MagicMock()
        return flyte_remote


def test_sync_node_execution_lacks_depth_or_visited_check(remote):
    """Static-analysis assertion: the recursive call sites lack any
    depth/visited mechanism. Source-level evidence of the gap."""
    sync_node_src = inspect.getsource(remote.sync_node_execution)
    sync_exec_src = inspect.getsource(remote.sync_execution)

    # Confirms the recursive call sites exist
    assert "self.sync_execution(launched_exec" in sync_node_src
    assert "self.sync_node_execution(" in sync_node_src

    # No depth parameter or visited set
    assert "depth" not in sync_node_src.lower() or "depth=" not in sync_node_src
    assert "visited" not in sync_node_src.lower()
    assert "depth" not in sync_exec_src.lower() or "max_depth" not in sync_exec_src


def test_recursion_limit_demonstration():
    """Synthetic mirror of the sync_node_execution call shape; crashes
    with RecursionError when given a deeply-nested input graph at low
    recursion limit. Shows the consequence of the missing bound."""
    def synthetic_sync_node_execution(node, depth_observed=None):
        if depth_observed is None:
            depth_observed = []
        depth_observed.append(len(depth_observed))
        if node.get("launched_exec_id"):
            for child in node["launched_exec_id"].get("nodes", []):
                synthetic_sync_node_execution(child, depth_observed)
        return depth_observed

    def chain(n):
        # ITERATIVE builder so we don't hit the recursion limit at
        # construction time
        node = {"launched_exec_id": None}
        for _ in range(n):
            node = {"launched_exec_id": {"nodes": [node]}}
        return node

    deep_node = chain(60)
    original = sys.getrecursionlimit()
    try:
        sys.setrecursionlimit(50)
        with pytest.raises(RecursionError):
            synthetic_sync_node_execution(deep_node)
    finally:
        sys.setrecursionlimit(original)

Both tests pass on master @ 41a9f80:

tests/flytekit/unit/remote/test_recursion_repro.py::test_sync_node_execution_lacks_depth_or_visited_check PASSED
tests/flytekit/unit/remote/test_recursion_repro.py::test_recursion_limit_demonstration PASSED

The first test is source-level evidence. The second is a runtime demonstration of the call-shape — the synthetic mirror crashes under a low limit, mirroring what happens to flytekit users whose workflow graphs exceed sys.getrecursionlimit().

Suggested fix

Add a depth parameter that's incremented on every recursive call:

def sync_execution(
    self,
    execution: FlyteWorkflowExecution,
    entity_definition: typing.Union[FlyteWorkflow, FlyteTask] = None,
    sync_nodes: bool = False,
+   _depth: int = 0,
+   _max_depth: int = 50,
) -> FlyteWorkflowExecution:
+   if _depth > _max_depth:
+       raise FlyteAssertion(
+           f"Workflow execution {execution.id} has nesting deeper than "
+           f"_max_depth={_max_depth}; refusing to recurse further to "
+           f"avoid RecursionError."
+       )
    ...
    if sync_nodes:
        node_execs[n.id.node_id] = self.sync_node_execution(
-           n, node_mapping
+           n, node_mapping, _depth=_depth + 1, _max_depth=_max_depth,
        )

_max_depth=50 is a generous default — well above any reasonable user nesting and far below Python's default recursion limit. Users with legitimately-deep graphs can pass a larger value; everyone else gets a clear FlyteAssertion instead of a RecursionError.

A visited-set keyed on WorkflowExecutionIdentifier would catch genuine cycles separately (server-side cycles shouldn't reach the client, but defense-in-depth is cheap).

Environment

  • flytekit: master @ 41a9f8009128db1391697cec0441e924e7e27d68
  • Python: 3.12, default recursion limit 1000
  • Reproducer environment: pip install -e . from flytekit checkout

Severity

Medium-Low reliability flaw, not a security finding. Per-client DoS, requires authenticated workflow-define rights, fix is small.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions