Skip to content

fix(agents): early dedup for repeated mutation tool calls#1557

Merged
itomek merged 4 commits into
mainfrom
autofix/issue-1317
Jun 11, 2026
Merged

fix(agents): early dedup for repeated mutation tool calls#1557
itomek merged 4 commits into
mainfrom
autofix/issue-1317

Conversation

@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

The base agent caught repeated query tool calls at the first repeat but had no equivalent for mutation tools. Before: a small model that re-issued an identical mutation (e.g. mark_read on an already-read id) burned ~4 planning steps before the reactive loop-detector halted it, and could drop the last item in a mutation sequence. After: an identical mutation re-issue is caught at the first repeat and the agent gets a corrective re-plan signal, at parity with query dedup. The cache is keyed on (tool, normalized args) so mutations on different ids are never suppressed, and the signal is injected after execution — no call is silently dropped.

Closes #1317.

Test plan

  • pytest tests/unit/agents/test_mutation_dedup.py — 6 tests (first-repeat detection, distinct-ids not suppressed, arg-order normalization, batch variants, non-mutation no-op, unhashable-args fallback)
  • python util/lint.py --all
  • Required before merge — base agent loop is LLM-affecting: run gaia eval agent on the relevant category and --compare against the committed baseline; confirm no regression (acceptance criterion in fix(agents): early dedup for repeated mutation tool calls #1317; not optional for base-agent changes)

@github-actions github-actions Bot added tests Test changes agents labels Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review: early dedup for repeated mutation tool calls

Summary

Solid, well-scoped fix that brings mutation tools to parity with the existing query-dedup signal. The core insight — key mutations on (tool, normalized args) rather than the result hash, because re-issuing mark_read on an already-read id returns a different result — is correct and clearly documented. No silent drops: the mutation still executes and a re-plan nudge is injected, consistent with GAIA's fail-loudly philosophy. The code change is small and the test coverage is deterministic and thoughtful.

The single most important thing: this touches the base agent loop, which CLAUDE.md classifies as LLM-affecting. The PR's own test plan lists gaia eval agent --compare as required before merge and it's still unchecked. That eval is the merge gate here — the unit tests pass code paths but won't catch a behavioral regression in the loop.

Issues Found

🟡 Eval run is a hard merge gate, not optional (src/gaia/agents/base/agent.py)
Per CLAUDE.md, changes to the base agent loop require a gaia eval agent run compared against the committed baseline before merge — and you've correctly flagged this in the PR description as required. Just calling it out explicitly: don't merge on green unit tests alone. The injected [SYSTEM] re-plan message changes what the model sees mid-loop, which is exactly the kind of change unit tests can't validate.

🟡 _MUTATION_TOOLS omits two real mutation tools (agent.py:14)
delete_tools.py also registers permanent_delete and restore_message as @tools — both mutate external state, and permanent_delete is the most destructive of the set. They'll fall back to the slow reactive loop-detector that this PR exists to front-run. Either include them or note why they're excluded (e.g. permanent_delete erroring on a second call is "loud enough"):

    "remove_star_batch",
    "trash_message",
    "restore_message",
    "permanent_delete",
)

(Confirmed there is no trash_message_batch tool, so its absence is correct.)

🟢 Verify provider tolerance for the user-message-between-tool-results ordering (agent.py:3557, agent.py:3783)
The corrective message is appended to messages before the current call's tool-result message, so in the native fan-out path you can get assistant(tool_calls) → user(dedup) → tool(result). Mutation tools (batch email) fan out far more than query tools do, so this interleaving will be exercised harder than the query-dedup path that established the pattern. It matches the existing _QUERY_TOOLS placement so it's not a new bug — but it's worth confirming the native tool-calling path tolerates a user turn sandwiched between an assistant's tool_calls and their tool responses. The eval run should surface it if it's a problem.

🟢 Domain tool names hardcoded in the base class (agent.py:14)
_MUTATION_TOOLS couples the base agent to EmailTriageAgent's tool vocabulary, the same way _QUERY_TOOLS couples it to RAG names. It matches precedent so I won't block on it, but an overridable hook (e.g. _get_mutation_tool_names() that agents extend) would keep the base layer domain-agnostic and let other agents opt in. Fine to defer.

Strengths

  • The "why args, not result" rationale in both the module comment and the docstring is genuinely useful — it documents the non-obvious invariant a future reader would otherwise re-derive (or break).
  • Test suite is deterministic (no live Lemonade/LLM) and covers the right edges: first-repeat detection, distinct-ids-not-suppressed, arg-order normalization, batch variants, non-mutation no-op (asserting cache == {}), and the unhashable-args str() fallback.
  • Clean parity with the established query-dedup pattern — re-plan signal instead of a silent drop, cache scoped to the loop run, threshold mirroring >= 2.

Verdict

Approve with suggestions — the code is sound and the tests are strong. No code-level blocker, but treat the gaia eval agent comparison as a hard pre-merge gate (it's already in your test plan, just unchecked), and please decide on permanent_delete/restore_message before merging.

@itomek itomek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additive advisory signal keyed on (tool, args); mutation still always executes. LGTM.


Generated by Claude Code

@itomek itomek added this pull request to the merge queue Jun 11, 2026
Merged via the queue into main with commit b0c03c5 Jun 11, 2026
37 checks passed
@itomek itomek deleted the autofix/issue-1317 branch June 11, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(agents): early dedup for repeated mutation tool calls

2 participants