Skip to content

fix(agent): repair Windows-path tool args; gate SD override (#1023)#1027

Open
kovtcharov wants to merge 2 commits intomainfrom
fix/sd-windows-path-escape-1023
Open

fix(agent): repair Windows-path tool args; gate SD override (#1023)#1027
kovtcharov wants to merge 2 commits intomainfrom
fix/sd-windows-path-escape-1023

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Why this matters

gaia sd <prompt> on Windows would generate an image successfully in Step 1 but then return "Image generation is not available — start GAIA with the --sd flag to enable it" in Step 2 — telling the user to enable a feature that was already on, while silently dropping the story output.

Two stacked bugs:

  • Parse fragility — the native tool_calls parser strictly required JSON-valid backslash escapes, but the LLM probabilistically emits C:\Users\Klaus (single-escape) which JSON rejects as Invalid \escape: \U. Now repaired before re-raise; idempotent and lossless on already-valid input.
  • Overaggressive verbose-failure override — the post-failure guard fired whenever generate_image had been called this turn, even when it succeeded. Now gated on the latest capability call actually erroring.

Test plan

  • pytest tests/unit/agents/test_parse_error_recovery.py — 5 new tests + 6 pre-existing all pass
  • pytest tests/unit/agents/ tests/unit/test_sd_agent.py tests/unit/test_tool_call_priority.py — 190 pass, no regressions
  • Manual on Windows: gaia sd "a forest" --logging-level DEBUG — Step 2 parses the Windows path, story appears in the final answer (not the misleading "--sd" message)

Closes #1023

`gaia sd <prompt>` on Windows would generate an image successfully in
Step 1 but then return *"Image generation is not available — start
GAIA with the `--sd` flag to enable it"* in Step 2 — telling the user
to enable a feature that was already on, while silently dropping the
story output.

Two stacked bugs:

A. Parse fragility: the native tool_calls parser strictly required
   JSON-valid backslash escapes.  Smaller LLMs (Gemma-4-E4B-class)
   probabilistically emit Windows paths with single backslashes
   (`C:\Users\Klaus`), which strict JSON rejects as `Invalid \escape`.
   `_parse_llm_response` now retries once with `\X` (where X is not a
   valid JSON escape char) doubled to `\\X` — idempotent on
   already-valid input.  Truly malformed JSON still raises loudly.

B. Overaggressive verbose-failure override: the post-failure guard at
   `_process_query_impl` fired whenever `generate_image` had been
   *called* this turn — with no check on whether the call actually
   returned an error.  When generate_image succeeded and a *different*
   tool's parse error provoked a verbose apology, the guard clobbered
   the model's reply with the misleading "not available" message.
   Track the latest capability-tool outcome and gate the override on
   `capability_tool_last_succeeded is False`.
Review pass on the #1023 fix surfaced one inconsistency and one missing
regression guard.

The new ``capability_tool_last_succeeded`` tracker matched
``tool_name.startswith(...)`` while the pre-existing
``has_tried_capability_tool`` check uses ``.lower().startswith(...)``.
A model emitting ``Generate_Image`` would slip past the tracker (case
mismatch) but still trigger ``has_tried_capability_tool`` (after lower)
-- so on a *failed* call the gate evaluated ``True and (None is False)``
= False and the override silently failed to fire, leaking the verbose
apology to the user instead of the canonical "not available" message.
Both tracker call sites now apply ``.lower()`` to match.

Tests: added ``test_override_still_fires_when_generate_image_failed``
as a regression guard for the legitimate failure path (without it, a
future refactor could remove the gate entirely and only the
post-success test would catch it), and replaced the previously-useless
mixed-case success test with ``test_override_fires_after_mixed_case_capability_failure``
-- the success variant passed regardless of the bug because both
branches yield "no override" when ``capability_tool_last_succeeded`` is
``None``; only the failure variant pins the ``.lower()`` semantics
down.  Red-green verified by stashing the ``.lower()`` fix and watching
the new test fail with the exact verbose-apology leak.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: tool call parsing fails in SD step

1 participant