fix(agents): bound and cancel tool execution so hung tools fail loudly#1591
Conversation
A hung agent tool (e.g. a stuck connector token call) left the Agent UI "thinking" forever: the base loop called tool(**args) with no timeout and process_query never observed any cancel signal, so the producer thread leaked. The consumer's 600s cap only broke the SSE loop, never the worker. Add two composable, fail-loud guards in the base Agent (all agents benefit): - Per-tool execution timeout in _execute_tool: the tool body runs in a daemon worker joined with a timeout; on timeout it raises ToolExecutionTimeout -> actionable error dict (names the tool and the GAIA_AGENT_TOOL_TIMEOUT knob). Default 180s via tool_execution_timeout() (mirrors default_max_steps: lazy env read, raises on invalid). New @tool(timeout=...) override; generate_image opts out at 900s since SD model download can take ~600s. - Cooperative cancel: Agent._cancel_event checked at each step boundary. _stream_chat_response assigns a fresh event per request and sets it in _cleanup_stream, so the existing producer.join(5s) finally reaps the thread instead of logging "still running". Per-tool timeouts keep each step bounded, guaranteeing the boundary is reached. Together a typical hung tool surfaces an error in <=180s (well under 600s), and the 600s cap now actually tears the producer down. Tests: tests/unit/test_agent_tool_timeout.py covers bounded actionable error (not a 10-min hang), per-tool override, env parsing incl. invalid-raises, the SD 900s opt-out, and producer-not-outliving-request.
SummarySolid, well-scoped fix for the hung-tool dead-end behind #1579: a per-tool execution timeout plus a cooperative cancel signal, both living in the base Issues🟡 Important — long-running index/summarize tools will hit the 180s cap and be abandoned mid-write
Two consequences for these: (1) the user gets a spurious timeout error on a perfectly valid operation, and (2) because the worker is only abandoned, not killed, the zombie thread keeps writing to the FAISS index / DB after the agent loop has moved on — concurrent writes to shared state, exactly the kind of corruption the daemon-thread approach risks for stateful tools. (The PR already documents the can't-kill-the-thread limit; the stateful-write angle is the part worth weighing.) Please audit the long-running tools and either add explicit @tool(
name="index_directory",
timeout=900, # bulk ingest of a directory can legitimately run long
...
)🟢 Minor — cancel signal is wired only on the streaming path
🟢 Minor — cancelled run returns a success-shaped final answerOn cancel, the loop sets Strengths
VerdictRequest changes — only for the 🟡: make a conscious call on the long-running index/summarize tools (override or confirm 180s is fine) so this doesn't ship a silent regression for the RAG and code-index agents. The two 🟢 items are optional. Everything else is ready to merge. |
The 180s default per-tool timeout would abandon legitimately long-running, stateful tools mid-write — corrupting the shared FAISS index / DB the same way the daemon-thread approach risks for any stateful tool. Add explicit @tool(timeout=...) opt-outs for the bulk-ingest / summarize family that routinely runs past 180s: - index_directory -> 900s (bulk directory ingest) - index_codebase -> 900s (whole-repo embedding) - index_document -> 600s (large PDF parse + chunk + embed) - summarize_document -> 600s (iterative section summarization on local NPU) Lock the opt-outs with a regression test that resolves each through the real registration path + _resolve_tool_timeout.
|
Thanks — addressed in 🟡 Long-running index/summarize tools (fixed). Audited the tool surface and added explicit
Locked with a regression test that resolves each through the real registration path + 🟢 Cancel wired only on the streaming path (confirmed, no change needed). Streaming is the only path with the producer-thread + consumer-timeout split that can leak a thread. The non-streaming chat path ( 🟢 Cancelled run returns a success-shaped answer (intentional, left as-is). The cancel branch targets teardown on stream-timeout/disconnect, where the consumer has already given up and discards the result — so the run returns the actionable string rather than a distinct status. Tagging adds little for the only caller today; happy to add a |
Why this matters
Before: a hung agent tool — e.g. the stuck connector token call behind #1579 — left the Agent UI stuck in "thinking" with no error, indefinitely. The base agent loop called
tool(**args)with no timeout andprocess_querynever observed any cancel signal, so the worker (threading.Thread(target=_run_agent)) leaked. The only backstop was the consumer-side 600s SSE timeout, which took a full 10 minutes and only broke the consumer loop, never the producer.After: every tool call is bounded (default 180s), so a hung tool surfaces an actionable error well under 600s instead of an infinite hang — and the producer thread is actually torn down rather than leaked. #1589 fixed the specific connector root cause; this closes the systemic gap so the next tool hang can't reproduce the same dead-end UX.
Two composable, fail-loud guards live in the base
Agent, so all agents benefit:_execute_tool): the tool body runs in a daemon worker joined with the resolved timeout; on timeout it raisesToolExecutionTimeout→ an actionable error dict naming the tool and theGAIA_AGENT_TOOL_TIMEOUTknob. Default 180s viatool_execution_timeout()(mirrors the existingdefault_max_steps()— lazy env read, raises on an invalid value). New@tool(timeout=...)override;generate_imageopts out at 900s because SD model download can legitimately take ~600s.Agent._cancel_event): checked at each step boundary in_process_query_impl._stream_chat_responseassigns a fresh event per request and sets it in_cleanup_stream, so the existingproducer.join(5s)finally reaps the thread. The per-tool timeout keeps each step bounded, guaranteeing the cancel boundary is reached.Known, documented limit: Python can't kill the timed-out worker thread, so a hung tool's worker keeps running until the process exits — but it's a daemon and no longer blocks the agent loop or leaks the producer.
Closes #1579.
Test plan
python -m pytest tests/unit/test_agent_tool_timeout.py -q— 14 tests: blocking tool → bounded actionable error in <5s (not 30s/600s); per-tool override beats global;tool_execution_timeout()parsing incl. invalid-raises; SDgenerate_image900s opt-out resolves through_resolve_tool_timeout; producer-style thread does not outlive the request on mid-run cancel (assertssend_messagescalled once + thread dead).python -m pytest tests/unit/test_tool_decorator.py tests/unit/test_tool_registry_isolation.py tests/unit/agents/test_null_tool_name.py tests/unit/agents/test_tool_not_found_error.py tests/unit/chat/ui/test_chat_helpers.py -q— regression set, green.python util/lint.py --black --isort— clean on the changed files.