fix(agents): bound and cancel tool execution so hung tools fail loudly by itomek · Pull Request #1591 · amd/gaia

itomek · 2026-06-11T15:23:16Z

Why this matters

Before: a hung agent tool — e.g. the stuck connector token call behind #1579 — left the Agent UI stuck in "thinking" with no error, indefinitely. The base agent loop called tool(**args) with no timeout and process_query never observed any cancel signal, so the worker (threading.Thread(target=_run_agent)) leaked. The only backstop was the consumer-side 600s SSE timeout, which took a full 10 minutes and only broke the consumer loop, never the producer.

After: every tool call is bounded (default 180s), so a hung tool surfaces an actionable error well under 600s instead of an infinite hang — and the producer thread is actually torn down rather than leaked. #1589 fixed the specific connector root cause; this closes the systemic gap so the next tool hang can't reproduce the same dead-end UX.

Two composable, fail-loud guards live in the base Agent, so all agents benefit:

Per-tool execution timeout (_execute_tool): the tool body runs in a daemon worker joined with the resolved timeout; on timeout it raises ToolExecutionTimeout → an actionable error dict naming the tool and the GAIA_AGENT_TOOL_TIMEOUT knob. Default 180s via tool_execution_timeout() (mirrors the existing default_max_steps() — lazy env read, raises on an invalid value). New @tool(timeout=...) override; generate_image opts out at 900s because SD model download can legitimately take ~600s.
Cooperative cancel (Agent._cancel_event): checked at each step boundary in _process_query_impl. _stream_chat_response assigns a fresh event per request and sets it in _cleanup_stream, so the existing producer.join(5s) finally reaps the thread. The per-tool timeout keeps each step bounded, guaranteeing the cancel boundary is reached.

Known, documented limit: Python can't kill the timed-out worker thread, so a hung tool's worker keeps running until the process exits — but it's a daemon and no longer blocks the agent loop or leaks the producer.

Closes #1579.

Test plan

python -m pytest tests/unit/test_agent_tool_timeout.py -q — 14 tests: blocking tool → bounded actionable error in <5s (not 30s/600s); per-tool override beats global; tool_execution_timeout() parsing incl. invalid-raises; SD generate_image 900s opt-out resolves through _resolve_tool_timeout; producer-style thread does not outlive the request on mid-run cancel (asserts send_messages called once + thread dead).
python -m pytest tests/unit/test_tool_decorator.py tests/unit/test_tool_registry_isolation.py tests/unit/agents/test_null_tool_name.py tests/unit/agents/test_tool_not_found_error.py tests/unit/chat/ui/test_chat_helpers.py -q — regression set, green.
python util/lint.py --black --isort — clean on the changed files.

@tool

A hung agent tool (e.g. a stuck connector token call) left the Agent UI "thinking" forever: the base loop called tool(**args) with no timeout and process_query never observed any cancel signal, so the producer thread leaked. The consumer's 600s cap only broke the SSE loop, never the worker. Add two composable, fail-loud guards in the base Agent (all agents benefit): - Per-tool execution timeout in _execute_tool: the tool body runs in a daemon worker joined with a timeout; on timeout it raises ToolExecutionTimeout -> actionable error dict (names the tool and the GAIA_AGENT_TOOL_TIMEOUT knob). Default 180s via tool_execution_timeout() (mirrors default_max_steps: lazy env read, raises on invalid). New @tool(timeout=...) override; generate_image opts out at 900s since SD model download can take ~600s. - Cooperative cancel: Agent._cancel_event checked at each step boundary. _stream_chat_response assigns a fresh event per request and sets it in _cleanup_stream, so the existing producer.join(5s) finally reaps the thread instead of logging "still running". Per-tool timeouts keep each step bounded, guaranteeing the boundary is reached. Together a typical hung tool surfaces an error in <=180s (well under 600s), and the 600s cap now actually tears the producer down. Tests: tests/unit/test_agent_tool_timeout.py covers bounded actionable error (not a 10-min hang), per-tool override, env parsing incl. invalid-raises, the SD 900s opt-out, and producer-not-outliving-request.

github-actions · 2026-06-11T15:31:05Z

Summary

Solid, well-scoped fix for the hung-tool dead-end behind #1579: a per-tool execution timeout plus a cooperative cancel signal, both living in the base Agent so every agent benefits. The design mirrors the existing default_max_steps() pattern (lazy env read, fail-loud on bad input), is fail-loud throughout, and ships with a thorough 14-test suite and CLI docs. The one thing worth a decision before merge: the 180s default will now abort legitimately long-running tools that didn't get a @tool(timeout=...) opt-out — generate_image was handled, but RAG/code indexing and document summarization were not, and they can run well past 180s.

Issues

🟡 Important — long-running index/summarize tools will hit the 180s cap and be abandoned mid-write

generate_image correctly opts out at 900s, but several other tools routinely run longer than the new 180s default and got no override:

index_directory (src/gaia/agents/tools/rag_tools.py:1943) loops over every supported file in a directory with no internal bound — a folder of PDFs easily exceeds 180s.
index_codebase (src/gaia/agents/tools/code_index_tools.py:99) embeds a whole repo; large repos take minutes.
summarize_document (src/gaia/agents/tools/rag_tools.py:1450) over a large doc on a local NPU can exceed 180s.

Two consequences for these: (1) the user gets a spurious timeout error on a perfectly valid operation, and (2) because the worker is only abandoned, not killed, the zombie thread keeps writing to the FAISS index / DB after the agent loop has moved on — concurrent writes to shared state, exactly the kind of corruption the daemon-thread approach risks for stateful tools. (The PR already documents the can't-kill-the-thread limit; the stateful-write angle is the part worth weighing.)

Please audit the long-running tools and either add explicit @tool(timeout=...) overrides where the duration is legitimate, or confirm 180s is acceptable for them. Same one-line pattern already used for SD:

@tool(
    name="index_directory",
    timeout=900,  # bulk ingest of a directory can legitimately run long
    ...
)

🟢 Minor — cancel signal is wired only on the streaming path

agent._cancel_event is assigned in _stream_chat_response (src/gaia/ui/_chat_helpers.py:1832). If any non-streaming agent run can leak a producer thread the same way, it won't observe the cancel. If streaming is the only producer-thread path today, ignore this — worth a one-line confirmation in the PR thread.

🟢 Minor — cancelled run returns a success-shaped final answer

On cancel, the loop sets final_answer to a user-facing string and breaks (src/gaia/agents/base/agent.py:2614), so the run returns as a normal answer rather than a cancelled/error status. Fine for the teardown case it targets, but a caller inspecting the result can't distinguish "completed" from "cancelled." Consider tagging the result so downstream can tell them apart.

Strengths

Pattern-consistent and fail-loud: tool_execution_timeout() mirrors default_max_steps() exactly (lazy env read, raises on present-but-invalid, no silent fallback), and the timeout error names the tool, the window, and the GAIA_AGENT_TOOL_TIMEOUT knob — textbook actionable error.
Clean composition: per-tool @tool(timeout=...) override beating the global default is the right layering; confirmation-gated tools still confirm before the bounded call, so that guard is untouched.
Genuinely good test coverage: bounded-in-<5s assertions (not just "eventually"), invalid-env parametrization, per-tool-beats-global, the SD 900s opt-out resolved through the real _resolve_tool_timeout, and producer-teardown verified by thread liveness + a single send_messages call.

Verdict

Request changes — only for the 🟡: make a conscious call on the long-running index/summarize tools (override or confirm 180s is fine) so this doesn't ship a silent regression for the RAG and code-index agents. The two 🟢 items are optional. Everything else is ready to merge.

@tool

The 180s default per-tool timeout would abandon legitimately long-running, stateful tools mid-write — corrupting the shared FAISS index / DB the same way the daemon-thread approach risks for any stateful tool. Add explicit @tool(timeout=...) opt-outs for the bulk-ingest / summarize family that routinely runs past 180s: - index_directory -> 900s (bulk directory ingest) - index_codebase -> 900s (whole-repo embedding) - index_document -> 600s (large PDF parse + chunk + embed) - summarize_document -> 600s (iterative section summarization on local NPU) Lock the opt-outs with a regression test that resolves each through the real registration path + _resolve_tool_timeout.

itomek · 2026-06-11T15:36:18Z

Thanks — addressed in 10a1a564.

🟡 Long-running index/summarize tools (fixed). Audited the tool surface and added explicit @tool(timeout=...) opt-outs for the bulk-ingest / summarize family that legitimately runs past 180s and writes to shared FAISS/DB state, so a valid operation is never abandoned mid-write:

index_directory → 900s
index_codebase → 900s
index_document → 600s (large PDF parse + chunk + embed)
summarize_document → 600s

Locked with a regression test that resolves each through the real registration path + _resolve_tool_timeout. The 180s default now only bites genuinely quick-by-nature tools (the hung-connector case #1579 targets). Everything else (fast query/search/status tools) stays bounded.

🟢 Cancel wired only on the streaming path (confirmed, no change needed). Streaming is the only path with the producer-thread + consumer-timeout split that can leak a thread. The non-streaming chat path (_chat_helpers.py:1337) and the autonomous loop (agent_loop.py) call process_query directly and await it — no producer/consumer split, nothing to leak. And the per-tool timeout (layer 1) is universal — it lives in the base Agent, so a hung tool is bounded in every path (CLI, streaming, non-streaming, autonomous); only the additional cooperative cancel is streaming-specific, which is exactly where it's needed.

🟢 Cancelled run returns a success-shaped answer (intentional, left as-is). The cancel branch targets teardown on stream-timeout/disconnect, where the consumer has already given up and discards the result — so the run returns the actionable string rather than a distinct status. Tagging adds little for the only caller today; happy to add a status: "cancelled" field in a follow-up if a downstream consumer needs to distinguish it.

github-actions Bot added documentation Documentation changes tests Test changes agents labels Jun 11, 2026

itomek self-assigned this Jun 11, 2026

itomek marked this pull request as ready for review June 11, 2026 15:27

itomek requested a review from kovtcharov-amd as a code owner June 11, 2026 15:27

itomek added this to the v0.21.0 - Email Agent & Platform Foundations milestone Jun 11, 2026

itomek enabled auto-merge June 11, 2026 15:29

kovtcharov-amd approved these changes Jun 11, 2026

View reviewed changes

itomek added this pull request to the merge queue Jun 11, 2026

Merged via the queue into main with commit 8cd5c5c Jun 11, 2026
48 checks passed

itomek deleted the tmi/quirky-moore-c021cf branch June 11, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): bound and cancel tool execution so hung tools fail loudly#1591

fix(agents): bound and cancel tool execution so hung tools fail loudly#1591
itomek merged 2 commits into
mainfrom
tmi/quirky-moore-c021cf

itomek commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

itomek commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itomek commented Jun 11, 2026

Why this matters

Test plan

Uh oh!

github-actions Bot commented Jun 11, 2026

Summary

Issues

🟡 Important — long-running index/summarize tools will hit the 180s cap and be abandoned mid-write

🟢 Minor — cancel signal is wired only on the streaming path

🟢 Minor — cancelled run returns a success-shaped final answer

Strengths

Verdict

Uh oh!

itomek commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants