Support visual refs generated from OCR and image anchors

## Problem

Autofish's ref-based interaction model is a strong fit for agent control because actions can target stable refs instead of brittle raw coordinates. However, some Android surfaces are only partially represented, or not represented at all, in the Accessibility tree:

- games
- custom Canvas/OpenGL views
- heavily customized WebViews
- media or short-video apps
- UI elements rendered as images

In those cases, the current fallback is usually coordinate-based tapping, which loses the main advantage of Autofish's agent-first design.

## Proposal

Add optional visual refs alongside Accessibility refs:

```bash
af observe page --field refs --field visual-refs
af act tap --by ref --value @v1
```

Example output shape:

```json
{
  "visual_refs": [
    {
      "ref": "@v1",
      "type": "text",
      "text": "Log in",
      "bounds": [120, 840, 310, 900],
      "source": "ocr",
      "confidence": 0.94,
      "frame_id": "..."
    },
    {
      "ref": "@v2",
      "type": "image",
      "label": "cart icon",
      "bounds": [930, 120, 1000, 190],
      "source": "image-anchor",
      "confidence": 0.88,
      "frame_id": "..."
    }
  ]
}
```

The important part is preserving the existing Autofish interaction contract: agents still act through refs, but refs can come from more than the Accessibility tree.

## Design notes

- Visual refs should be explicitly marked with their source and confidence.
- Visual refs should be tied to a `frame_id` or observation timestamp so agents can avoid acting on stale frames.
- This should be opt-in, not always-on, because OCR/image detection may have extra latency or dependencies.
- The first implementation could be OCR-only; image anchors or template matching can come later.
- Autofish does not need to become a vision or LLM framework. It only needs to expose deterministic visual candidates that an agent can reason about.

## Why this improves capability

This would expand Autofish from standard app automation into a much wider class of real-world Android tasks while preserving its core principle: stable, observable, ref-based actions rather than raw coordinate scripts.

## Acceptance criteria

- `observe page` can optionally return visual refs.
- `act tap --by ref` can target those refs just like Accessibility refs.
- The output makes source, confidence, bounds, and frame freshness explicit.
- Existing Accessibility refs and current workflows continue to work unchanged.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support visual refs generated from OCR and image anchors #11

Problem

Proposal

Design notes

Why this improves capability

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support visual refs generated from OCR and image anchors #11

Description

Problem

Proposal

Design notes

Why this improves capability

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions