Problem
Workers in HierarchicalSwarm that call external tools (MCP, browsers, shell) can hang indefinitely. There is no heartbeat, no timeout, and no reassignment — one stuck worker blocks the entire plan.
Proposed feature
- Per-worker heartbeat interval (default 30s).
- Per-subtask timeout (configurable on the swarm and overridable per subtask).
- On timeout: cancel the worker's task, log the failure, and reassign the subtask to another worker (or mark it FAILED after N retries).
Design sketch
HierarchicalSwarm.__init__ gains worker_timeout: Optional[int] = None and heartbeat_interval: int = 30.
- Worker execution wrapped in a
ThreadPoolExecutor future with .result(timeout=...).
- On
concurrent.futures.TimeoutError, the subtask goes back into the queue with an incremented retry count.
- After
max_retries (default 2), the subtask is marked FAILED and surfaced in the judge verdict.
Files
swarms/structs/hiearchical_swarm.py
Why
Production deployments routinely hit hung workers when tools misbehave. Without timeouts, one bad worker kills the entire swarm run.
Problem
Workers in
HierarchicalSwarmthat call external tools (MCP, browsers, shell) can hang indefinitely. There is no heartbeat, no timeout, and no reassignment — one stuck worker blocks the entire plan.Proposed feature
Design sketch
HierarchicalSwarm.__init__gainsworker_timeout: Optional[int] = Noneandheartbeat_interval: int = 30.ThreadPoolExecutorfuture with.result(timeout=...).concurrent.futures.TimeoutError, the subtask goes back into the queue with an incremented retry count.max_retries(default 2), the subtask is marked FAILED and surfaced in the judge verdict.Files
swarms/structs/hiearchical_swarm.pyWhy
Production deployments routinely hit hung workers when tools misbehave. Without timeouts, one bad worker kills the entire swarm run.