Skip to content

[feat][hiearchical_swarm][heartbeat and worker timeout with reassignment] #1554

@kyegomez

Description

@kyegomez

Problem

Workers in HierarchicalSwarm that call external tools (MCP, browsers, shell) can hang indefinitely. There is no heartbeat, no timeout, and no reassignment — one stuck worker blocks the entire plan.

Proposed feature

  • Per-worker heartbeat interval (default 30s).
  • Per-subtask timeout (configurable on the swarm and overridable per subtask).
  • On timeout: cancel the worker's task, log the failure, and reassign the subtask to another worker (or mark it FAILED after N retries).

Design sketch

  • HierarchicalSwarm.__init__ gains worker_timeout: Optional[int] = None and heartbeat_interval: int = 30.
  • Worker execution wrapped in a ThreadPoolExecutor future with .result(timeout=...).
  • On concurrent.futures.TimeoutError, the subtask goes back into the queue with an incremented retry count.
  • After max_retries (default 2), the subtask is marked FAILED and surfaced in the judge verdict.

Files

  • swarms/structs/hiearchical_swarm.py

Why

Production deployments routinely hit hung workers when tools misbehave. Without timeouts, one bad worker kills the entire swarm run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions