Problem
Autofish's ref-based interaction model is a strong fit for agent control because actions can target stable refs instead of brittle raw coordinates. However, some Android surfaces are only partially represented, or not represented at all, in the Accessibility tree:
- games
- custom Canvas/OpenGL views
- heavily customized WebViews
- media or short-video apps
- UI elements rendered as images
In those cases, the current fallback is usually coordinate-based tapping, which loses the main advantage of Autofish's agent-first design.
Proposal
Add optional visual refs alongside Accessibility refs:
af observe page --field refs --field visual-refs
af act tap --by ref --value @v1
Example output shape:
{
"visual_refs": [
{
"ref": "@v1",
"type": "text",
"text": "Log in",
"bounds": [120, 840, 310, 900],
"source": "ocr",
"confidence": 0.94,
"frame_id": "..."
},
{
"ref": "@v2",
"type": "image",
"label": "cart icon",
"bounds": [930, 120, 1000, 190],
"source": "image-anchor",
"confidence": 0.88,
"frame_id": "..."
}
]
}
The important part is preserving the existing Autofish interaction contract: agents still act through refs, but refs can come from more than the Accessibility tree.
Design notes
- Visual refs should be explicitly marked with their source and confidence.
- Visual refs should be tied to a
frame_id or observation timestamp so agents can avoid acting on stale frames.
- This should be opt-in, not always-on, because OCR/image detection may have extra latency or dependencies.
- The first implementation could be OCR-only; image anchors or template matching can come later.
- Autofish does not need to become a vision or LLM framework. It only needs to expose deterministic visual candidates that an agent can reason about.
Why this improves capability
This would expand Autofish from standard app automation into a much wider class of real-world Android tasks while preserving its core principle: stable, observable, ref-based actions rather than raw coordinate scripts.
Acceptance criteria
observe page can optionally return visual refs.
act tap --by ref can target those refs just like Accessibility refs.
- The output makes source, confidence, bounds, and frame freshness explicit.
- Existing Accessibility refs and current workflows continue to work unchanged.
Problem
Autofish's ref-based interaction model is a strong fit for agent control because actions can target stable refs instead of brittle raw coordinates. However, some Android surfaces are only partially represented, or not represented at all, in the Accessibility tree:
In those cases, the current fallback is usually coordinate-based tapping, which loses the main advantage of Autofish's agent-first design.
Proposal
Add optional visual refs alongside Accessibility refs:
Example output shape:
{ "visual_refs": [ { "ref": "@v1", "type": "text", "text": "Log in", "bounds": [120, 840, 310, 900], "source": "ocr", "confidence": 0.94, "frame_id": "..." }, { "ref": "@v2", "type": "image", "label": "cart icon", "bounds": [930, 120, 1000, 190], "source": "image-anchor", "confidence": 0.88, "frame_id": "..." } ] }The important part is preserving the existing Autofish interaction contract: agents still act through refs, but refs can come from more than the Accessibility tree.
Design notes
frame_idor observation timestamp so agents can avoid acting on stale frames.Why this improves capability
This would expand Autofish from standard app automation into a much wider class of real-world Android tasks while preserving its core principle: stable, observable, ref-based actions rather than raw coordinate scripts.
Acceptance criteria
observe pagecan optionally return visual refs.act tap --by refcan target those refs just like Accessibility refs.