fix(agent): repair Windows-path tool args; gate SD override (#1023)#1027
Open
kovtcharov wants to merge 2 commits intomainfrom
Open
fix(agent): repair Windows-path tool args; gate SD override (#1023)#1027kovtcharov wants to merge 2 commits intomainfrom
kovtcharov wants to merge 2 commits intomainfrom
Conversation
`gaia sd <prompt>` on Windows would generate an image successfully in Step 1 but then return *"Image generation is not available — start GAIA with the `--sd` flag to enable it"* in Step 2 — telling the user to enable a feature that was already on, while silently dropping the story output. Two stacked bugs: A. Parse fragility: the native tool_calls parser strictly required JSON-valid backslash escapes. Smaller LLMs (Gemma-4-E4B-class) probabilistically emit Windows paths with single backslashes (`C:\Users\Klaus`), which strict JSON rejects as `Invalid \escape`. `_parse_llm_response` now retries once with `\X` (where X is not a valid JSON escape char) doubled to `\\X` — idempotent on already-valid input. Truly malformed JSON still raises loudly. B. Overaggressive verbose-failure override: the post-failure guard at `_process_query_impl` fired whenever `generate_image` had been *called* this turn — with no check on whether the call actually returned an error. When generate_image succeeded and a *different* tool's parse error provoked a verbose apology, the guard clobbered the model's reply with the misleading "not available" message. Track the latest capability-tool outcome and gate the override on `capability_tool_last_succeeded is False`.
3 tasks
Review pass on the #1023 fix surfaced one inconsistency and one missing regression guard. The new ``capability_tool_last_succeeded`` tracker matched ``tool_name.startswith(...)`` while the pre-existing ``has_tried_capability_tool`` check uses ``.lower().startswith(...)``. A model emitting ``Generate_Image`` would slip past the tracker (case mismatch) but still trigger ``has_tried_capability_tool`` (after lower) -- so on a *failed* call the gate evaluated ``True and (None is False)`` = False and the override silently failed to fire, leaking the verbose apology to the user instead of the canonical "not available" message. Both tracker call sites now apply ``.lower()`` to match. Tests: added ``test_override_still_fires_when_generate_image_failed`` as a regression guard for the legitimate failure path (without it, a future refactor could remove the gate entirely and only the post-success test would catch it), and replaced the previously-useless mixed-case success test with ``test_override_fires_after_mixed_case_capability_failure`` -- the success variant passed regardless of the bug because both branches yield "no override" when ``capability_tool_last_succeeded`` is ``None``; only the failure variant pins the ``.lower()`` semantics down. Red-green verified by stashing the ``.lower()`` fix and watching the new test fail with the exact verbose-apology leak.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this matters
gaia sd <prompt>on Windows would generate an image successfully in Step 1 but then return "Image generation is not available — start GAIA with the--sdflag to enable it" in Step 2 — telling the user to enable a feature that was already on, while silently dropping the story output.Two stacked bugs:
C:\Users\Klaus(single-escape) which JSON rejects asInvalid \escape: \U. Now repaired before re-raise; idempotent and lossless on already-valid input.generate_imagehad been called this turn, even when it succeeded. Now gated on the latest capability call actually erroring.Test plan
pytest tests/unit/agents/test_parse_error_recovery.py— 5 new tests + 6 pre-existing all passpytest tests/unit/agents/ tests/unit/test_sd_agent.py tests/unit/test_tool_call_priority.py— 190 pass, no regressionsgaia sd "a forest" --logging-level DEBUG— Step 2 parses the Windows path, story appears in the final answer (not the misleading "--sd" message)Closes #1023