Skip to content

Support visual refs generated from OCR and image anchors #11

@vorojar

Description

@vorojar

Problem

Autofish's ref-based interaction model is a strong fit for agent control because actions can target stable refs instead of brittle raw coordinates. However, some Android surfaces are only partially represented, or not represented at all, in the Accessibility tree:

  • games
  • custom Canvas/OpenGL views
  • heavily customized WebViews
  • media or short-video apps
  • UI elements rendered as images

In those cases, the current fallback is usually coordinate-based tapping, which loses the main advantage of Autofish's agent-first design.

Proposal

Add optional visual refs alongside Accessibility refs:

af observe page --field refs --field visual-refs
af act tap --by ref --value @v1

Example output shape:

{
  "visual_refs": [
    {
      "ref": "@v1",
      "type": "text",
      "text": "Log in",
      "bounds": [120, 840, 310, 900],
      "source": "ocr",
      "confidence": 0.94,
      "frame_id": "..."
    },
    {
      "ref": "@v2",
      "type": "image",
      "label": "cart icon",
      "bounds": [930, 120, 1000, 190],
      "source": "image-anchor",
      "confidence": 0.88,
      "frame_id": "..."
    }
  ]
}

The important part is preserving the existing Autofish interaction contract: agents still act through refs, but refs can come from more than the Accessibility tree.

Design notes

  • Visual refs should be explicitly marked with their source and confidence.
  • Visual refs should be tied to a frame_id or observation timestamp so agents can avoid acting on stale frames.
  • This should be opt-in, not always-on, because OCR/image detection may have extra latency or dependencies.
  • The first implementation could be OCR-only; image anchors or template matching can come later.
  • Autofish does not need to become a vision or LLM framework. It only needs to expose deterministic visual candidates that an agent can reason about.

Why this improves capability

This would expand Autofish from standard app automation into a much wider class of real-world Android tasks while preserving its core principle: stable, observable, ref-based actions rather than raw coordinate scripts.

Acceptance criteria

  • observe page can optionally return visual refs.
  • act tap --by ref can target those refs just like Accessibility refs.
  • The output makes source, confidence, bounds, and frame freshness explicit.
  • Existing Accessibility refs and current workflows continue to work unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions