Skip to content

fix(agents): bound and cancel tool execution so hung tools fail loudly#1591

Merged
itomek merged 2 commits into
mainfrom
tmi/quirky-moore-c021cf
Jun 11, 2026
Merged

fix(agents): bound and cancel tool execution so hung tools fail loudly#1591
itomek merged 2 commits into
mainfrom
tmi/quirky-moore-c021cf

Conversation

@itomek

@itomek itomek commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Why this matters

Before: a hung agent tool — e.g. the stuck connector token call behind #1579 — left the Agent UI stuck in "thinking" with no error, indefinitely. The base agent loop called tool(**args) with no timeout and process_query never observed any cancel signal, so the worker (threading.Thread(target=_run_agent)) leaked. The only backstop was the consumer-side 600s SSE timeout, which took a full 10 minutes and only broke the consumer loop, never the producer.

After: every tool call is bounded (default 180s), so a hung tool surfaces an actionable error well under 600s instead of an infinite hang — and the producer thread is actually torn down rather than leaked. #1589 fixed the specific connector root cause; this closes the systemic gap so the next tool hang can't reproduce the same dead-end UX.

Two composable, fail-loud guards live in the base Agent, so all agents benefit:

  • Per-tool execution timeout (_execute_tool): the tool body runs in a daemon worker joined with the resolved timeout; on timeout it raises ToolExecutionTimeout → an actionable error dict naming the tool and the GAIA_AGENT_TOOL_TIMEOUT knob. Default 180s via tool_execution_timeout() (mirrors the existing default_max_steps() — lazy env read, raises on an invalid value). New @tool(timeout=...) override; generate_image opts out at 900s because SD model download can legitimately take ~600s.
  • Cooperative cancel (Agent._cancel_event): checked at each step boundary in _process_query_impl. _stream_chat_response assigns a fresh event per request and sets it in _cleanup_stream, so the existing producer.join(5s) finally reaps the thread. The per-tool timeout keeps each step bounded, guaranteeing the cancel boundary is reached.

Known, documented limit: Python can't kill the timed-out worker thread, so a hung tool's worker keeps running until the process exits — but it's a daemon and no longer blocks the agent loop or leaks the producer.

Closes #1579.

Test plan

  • python -m pytest tests/unit/test_agent_tool_timeout.py -q — 14 tests: blocking tool → bounded actionable error in <5s (not 30s/600s); per-tool override beats global; tool_execution_timeout() parsing incl. invalid-raises; SD generate_image 900s opt-out resolves through _resolve_tool_timeout; producer-style thread does not outlive the request on mid-run cancel (asserts send_messages called once + thread dead).
  • python -m pytest tests/unit/test_tool_decorator.py tests/unit/test_tool_registry_isolation.py tests/unit/agents/test_null_tool_name.py tests/unit/agents/test_tool_not_found_error.py tests/unit/chat/ui/test_chat_helpers.py -q — regression set, green.
  • python util/lint.py --black --isort — clean on the changed files.

A hung agent tool (e.g. a stuck connector token call) left the Agent UI
"thinking" forever: the base loop called tool(**args) with no timeout and
process_query never observed any cancel signal, so the producer thread
leaked. The consumer's 600s cap only broke the SSE loop, never the worker.

Add two composable, fail-loud guards in the base Agent (all agents benefit):

- Per-tool execution timeout in _execute_tool: the tool body runs in a
  daemon worker joined with a timeout; on timeout it raises
  ToolExecutionTimeout -> actionable error dict (names the tool and the
  GAIA_AGENT_TOOL_TIMEOUT knob). Default 180s via tool_execution_timeout()
  (mirrors default_max_steps: lazy env read, raises on invalid). New
  @tool(timeout=...) override; generate_image opts out at 900s since SD
  model download can take ~600s.

- Cooperative cancel: Agent._cancel_event checked at each step boundary.
  _stream_chat_response assigns a fresh event per request and sets it in
  _cleanup_stream, so the existing producer.join(5s) finally reaps the
  thread instead of logging "still running". Per-tool timeouts keep each
  step bounded, guaranteeing the boundary is reached.

Together a typical hung tool surfaces an error in <=180s (well under 600s),
and the 600s cap now actually tears the producer down.

Tests: tests/unit/test_agent_tool_timeout.py covers bounded actionable
error (not a 10-min hang), per-tool override, env parsing incl.
invalid-raises, the SD 900s opt-out, and producer-not-outliving-request.
@github-actions github-actions Bot added documentation Documentation changes tests Test changes agents labels Jun 11, 2026
@itomek itomek self-assigned this Jun 11, 2026
@itomek itomek marked this pull request as ready for review June 11, 2026 15:27
@itomek itomek requested a review from kovtcharov-amd as a code owner June 11, 2026 15:27
@itomek itomek enabled auto-merge June 11, 2026 15:29
@github-actions

Copy link
Copy Markdown
Contributor

Summary

Solid, well-scoped fix for the hung-tool dead-end behind #1579: a per-tool execution timeout plus a cooperative cancel signal, both living in the base Agent so every agent benefits. The design mirrors the existing default_max_steps() pattern (lazy env read, fail-loud on bad input), is fail-loud throughout, and ships with a thorough 14-test suite and CLI docs. The one thing worth a decision before merge: the 180s default will now abort legitimately long-running tools that didn't get a @tool(timeout=...) opt-out — generate_image was handled, but RAG/code indexing and document summarization were not, and they can run well past 180s.

Issues

🟡 Important — long-running index/summarize tools will hit the 180s cap and be abandoned mid-write

generate_image correctly opts out at 900s, but several other tools routinely run longer than the new 180s default and got no override:

  • index_directory (src/gaia/agents/tools/rag_tools.py:1943) loops over every supported file in a directory with no internal bound — a folder of PDFs easily exceeds 180s.
  • index_codebase (src/gaia/agents/tools/code_index_tools.py:99) embeds a whole repo; large repos take minutes.
  • summarize_document (src/gaia/agents/tools/rag_tools.py:1450) over a large doc on a local NPU can exceed 180s.

Two consequences for these: (1) the user gets a spurious timeout error on a perfectly valid operation, and (2) because the worker is only abandoned, not killed, the zombie thread keeps writing to the FAISS index / DB after the agent loop has moved on — concurrent writes to shared state, exactly the kind of corruption the daemon-thread approach risks for stateful tools. (The PR already documents the can't-kill-the-thread limit; the stateful-write angle is the part worth weighing.)

Please audit the long-running tools and either add explicit @tool(timeout=...) overrides where the duration is legitimate, or confirm 180s is acceptable for them. Same one-line pattern already used for SD:

@tool(
    name="index_directory",
    timeout=900,  # bulk ingest of a directory can legitimately run long
    ...
)

🟢 Minor — cancel signal is wired only on the streaming path

agent._cancel_event is assigned in _stream_chat_response (src/gaia/ui/_chat_helpers.py:1832). If any non-streaming agent run can leak a producer thread the same way, it won't observe the cancel. If streaming is the only producer-thread path today, ignore this — worth a one-line confirmation in the PR thread.

🟢 Minor — cancelled run returns a success-shaped final answer

On cancel, the loop sets final_answer to a user-facing string and breaks (src/gaia/agents/base/agent.py:2614), so the run returns as a normal answer rather than a cancelled/error status. Fine for the teardown case it targets, but a caller inspecting the result can't distinguish "completed" from "cancelled." Consider tagging the result so downstream can tell them apart.

Strengths

  • Pattern-consistent and fail-loud: tool_execution_timeout() mirrors default_max_steps() exactly (lazy env read, raises on present-but-invalid, no silent fallback), and the timeout error names the tool, the window, and the GAIA_AGENT_TOOL_TIMEOUT knob — textbook actionable error.
  • Clean composition: per-tool @tool(timeout=...) override beating the global default is the right layering; confirmation-gated tools still confirm before the bounded call, so that guard is untouched.
  • Genuinely good test coverage: bounded-in-<5s assertions (not just "eventually"), invalid-env parametrization, per-tool-beats-global, the SD 900s opt-out resolved through the real _resolve_tool_timeout, and producer-teardown verified by thread liveness + a single send_messages call.

Verdict

Request changes — only for the 🟡: make a conscious call on the long-running index/summarize tools (override or confirm 180s is fine) so this doesn't ship a silent regression for the RAG and code-index agents. The two 🟢 items are optional. Everything else is ready to merge.

The 180s default per-tool timeout would abandon legitimately long-running,
stateful tools mid-write — corrupting the shared FAISS index / DB the same
way the daemon-thread approach risks for any stateful tool. Add explicit
@tool(timeout=...) opt-outs for the bulk-ingest / summarize family that
routinely runs past 180s:

- index_directory  -> 900s (bulk directory ingest)
- index_codebase   -> 900s (whole-repo embedding)
- index_document   -> 600s (large PDF parse + chunk + embed)
- summarize_document -> 600s (iterative section summarization on local NPU)

Lock the opt-outs with a regression test that resolves each through the
real registration path + _resolve_tool_timeout.
@itomek

itomek commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks — addressed in 10a1a564.

🟡 Long-running index/summarize tools (fixed). Audited the tool surface and added explicit @tool(timeout=...) opt-outs for the bulk-ingest / summarize family that legitimately runs past 180s and writes to shared FAISS/DB state, so a valid operation is never abandoned mid-write:

  • index_directory → 900s
  • index_codebase → 900s
  • index_document → 600s (large PDF parse + chunk + embed)
  • summarize_document → 600s

Locked with a regression test that resolves each through the real registration path + _resolve_tool_timeout. The 180s default now only bites genuinely quick-by-nature tools (the hung-connector case #1579 targets). Everything else (fast query/search/status tools) stays bounded.

🟢 Cancel wired only on the streaming path (confirmed, no change needed). Streaming is the only path with the producer-thread + consumer-timeout split that can leak a thread. The non-streaming chat path (_chat_helpers.py:1337) and the autonomous loop (agent_loop.py) call process_query directly and await it — no producer/consumer split, nothing to leak. And the per-tool timeout (layer 1) is universal — it lives in the base Agent, so a hung tool is bounded in every path (CLI, streaming, non-streaming, autonomous); only the additional cooperative cancel is streaming-specific, which is exactly where it's needed.

🟢 Cancelled run returns a success-shaped answer (intentional, left as-is). The cancel branch targets teardown on stream-timeout/disconnect, where the consumer has already given up and discards the result — so the run returns the actionable string rather than a distinct status. Tagging adds little for the only caller today; happy to add a status: "cancelled" field in a follow-up if a downstream consumer needs to distinguish it.

@itomek itomek added this pull request to the merge queue Jun 11, 2026
Merged via the queue into main with commit 8cd5c5c Jun 11, 2026
48 checks passed
@itomek itomek deleted the tmi/quirky-moore-c021cf branch June 11, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents documentation Documentation changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Email Triage agent waits indefinitely without generating any output

2 participants