Skip to content

[Feature]: Skip multimodal-looker delegation in look_at when the current model natively supports vision #4624

@leostudiooo

Description

@leostudiooo

Prerequisites

  • I will write this issue in English (see our Language Policy)
  • I have searched existing issues and discussions to avoid duplicates
  • This feature request is specific to oh-my-opencode (not OpenCode core)
  • I have read the documentation or asked an AI coding agent with this project's GitHub URL loaded and couldn't find the answer

Problem Description

When Sisyphus runs on a model that natively supports image input (e.g. Gemini 3 Pro, GPT-5.2, GLM-4.1V, etc.), the agent still calls the look_at tool for image files in the working directory, which delegates analysis to the multimodal-looker agent (typically configured with a different model). This bypasses the current model's own vision capabilities entirely.

This causes three concrete problems:

  1. Capability bypass — The current model's visual understanding is skipped; analysis is actually performed by whichever model multimodal-looker uses. If the current model is the stronger multimodal model, users receive weaker analysis.
  2. Unnecessary latency and token cost — An extra inter-agent call + an independent LLM invocation are added for every image, increasing response time and cost.
  3. Analysis quality lossmultimodal-looker returns a text summary of the image. The current model can only work from that summary, unable to directly inspect image details, catch information missed by the summary, or form independent visual judgments.

Reproduction:

  1. Configure Sisyphus with a vision-capable model (e.g. gemini-3-pro)
  2. Place an image file (PNG, JPG, screenshot, etc.) in the working directory
  3. Ask the agent to inspect the image (e.g. "check the screenshot")
  4. Observe: agent calls look_at → delegates to multimodal-looker → a different model analyzes the image → returns a text summary

In contrast, calling the read tool directly on the same image file works correctly on vision-capable models — the model receives the image inline and processes it with its own visual understanding, with no delegation overhead.

Proposed Solution

When look_at is invoked, check whether the current agent's model natively supports image input (via modalities.input or model.capabilities.input.image). Two approaches:

Option A (Recommended) — Direct read fallback: If the model supports vision, internally redirect to the read tool path (OpenCode's native read already supports images). The model receives the image inline and processes it directly.

Option B — In-place processing: Keep the look_at API contract but skip the multimodal-looker dispatch internally, reading and returning the image content within the same tool call. Preserves tool-call consistency without the cross-agent overhead.

When the model does not support vision, preserve the current behavior and delegate to multimodal-looker.

Alternatives Considered

Adding an instruction in AGENTS.md:

DO NOT use the `look_at` tool for images. Use the `read` tool instead.

This is a partial workaround but does not reliably propagate to subagents and places the burden on the user rather than fixing the routing logic.

Doctor Output (Optional)

N/A — this is a design/behavior issue, not environment-specific.

Additional Context

Related issues:

This issue fills the gap: what should happen when the model already supports vision.

Feature Type

New Tool

Contribution

  • I'm willing to submit a PR for this feature
  • I can help with testing
  • I can help with documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions