Prerequisites
Problem Description
When Sisyphus runs on a model that natively supports image input (e.g. Gemini 3 Pro, GPT-5.2, GLM-4.1V, etc.), the agent still calls the look_at tool for image files in the working directory, which delegates analysis to the multimodal-looker agent (typically configured with a different model). This bypasses the current model's own vision capabilities entirely.
This causes three concrete problems:
- Capability bypass — The current model's visual understanding is skipped; analysis is actually performed by whichever model
multimodal-looker uses. If the current model is the stronger multimodal model, users receive weaker analysis.
- Unnecessary latency and token cost — An extra inter-agent call + an independent LLM invocation are added for every image, increasing response time and cost.
- Analysis quality loss —
multimodal-looker returns a text summary of the image. The current model can only work from that summary, unable to directly inspect image details, catch information missed by the summary, or form independent visual judgments.
Reproduction:
- Configure Sisyphus with a vision-capable model (e.g.
gemini-3-pro)
- Place an image file (PNG, JPG, screenshot, etc.) in the working directory
- Ask the agent to inspect the image (e.g. "check the screenshot")
- Observe: agent calls
look_at → delegates to multimodal-looker → a different model analyzes the image → returns a text summary
In contrast, calling the read tool directly on the same image file works correctly on vision-capable models — the model receives the image inline and processes it with its own visual understanding, with no delegation overhead.
Proposed Solution
When look_at is invoked, check whether the current agent's model natively supports image input (via modalities.input or model.capabilities.input.image). Two approaches:
Option A (Recommended) — Direct read fallback: If the model supports vision, internally redirect to the read tool path (OpenCode's native read already supports images). The model receives the image inline and processes it directly.
Option B — In-place processing: Keep the look_at API contract but skip the multimodal-looker dispatch internally, reading and returning the image content within the same tool call. Preserves tool-call consistency without the cross-agent overhead.
When the model does not support vision, preserve the current behavior and delegate to multimodal-looker.
Alternatives Considered
Adding an instruction in AGENTS.md:
DO NOT use the `look_at` tool for images. Use the `read` tool instead.
This is a partial workaround but does not reliably propagate to subagents and places the burden on the user rather than fixing the routing logic.
Doctor Output (Optional)
N/A — this is a design/behavior issue, not environment-specific.
Additional Context
Related issues:
This issue fills the gap: what should happen when the model already supports vision.
Feature Type
New Tool
Contribution
Prerequisites
Problem Description
When Sisyphus runs on a model that natively supports image input (e.g. Gemini 3 Pro, GPT-5.2, GLM-4.1V, etc.), the agent still calls the
look_attool for image files in the working directory, which delegates analysis to themultimodal-lookeragent (typically configured with a different model). This bypasses the current model's own vision capabilities entirely.This causes three concrete problems:
multimodal-lookeruses. If the current model is the stronger multimodal model, users receive weaker analysis.multimodal-lookerreturns a text summary of the image. The current model can only work from that summary, unable to directly inspect image details, catch information missed by the summary, or form independent visual judgments.Reproduction:
gemini-3-pro)look_at→ delegates tomultimodal-looker→ a different model analyzes the image → returns a text summaryIn contrast, calling the
readtool directly on the same image file works correctly on vision-capable models — the model receives the image inline and processes it with its own visual understanding, with no delegation overhead.Proposed Solution
When
look_atis invoked, check whether the current agent's model natively supports image input (viamodalities.inputormodel.capabilities.input.image). Two approaches:Option A (Recommended) — Direct
readfallback: If the model supports vision, internally redirect to thereadtool path (OpenCode's nativereadalready supports images). The model receives the image inline and processes it directly.Option B — In-place processing: Keep the
look_atAPI contract but skip themultimodal-lookerdispatch internally, reading and returning the image content within the same tool call. Preserves tool-call consistency without the cross-agent overhead.When the model does not support vision, preserve the current behavior and delegate to
multimodal-looker.Alternatives Considered
Adding an instruction in
AGENTS.md:This is a partial workaround but does not reliably propagate to subagents and places the burden on the user rather than fixing the routing logic.
Doctor Output (Optional)
N/A — this is a design/behavior issue, not environment-specific.
Additional Context
Related issues:
look_atcrashing whenmultimodal-lookeris disabled; the bypass itself was not resolved.multimodal-lookerwhen the main model does not support images.This issue fills the gap: what should happen when the model already supports vision.
Feature Type
New Tool
Contribution