Skip to content

Improve clicking accuracy

Gil Megidish edited this page Jun 9, 2025 · 2 revisions

TL;DR

  • When clicking on elements, try making your LLM use mobile_list_elements_on_screen tool.
  • If you are building something stable, use identifier when telling LLM what element to click on.
  • Screenshots should be last resort, as current models are incapable of exact recognition of boundaries.

Overview

LLMs are terrific at understanding unstructured data. That could be a list of elements on screen, and that could be a screenshot. LLMs that can do image recognition as well (including OpenAI, Claude, Gemini) are able to explain what's on screen.

A prompt of Take a screenshot and explain what you're seeing on screen will most likely invoke mobile_take_screenshot and then pass the image for the LLM to analyze. It will tell you the clock time, list the icons on screen, explain the photo that's currently being viewed and so on. Probably even supporting multiple speaking languages.

When it comes to clicking an element though, accuracy is important. Clicking even 1 pixel outside the boundaries of an element, might yield unexpected results.

Option 1: Using mobile_list_elements_on_screen

The tool mobile_list_elements_on_screen dumps the view hierarchy using WebDriverAgent's /source on iOS or uiautomator dump on Android. These somewhat represent a stripped down DOM of what's visible on screen.

The result includes elements and their label, name, value, type (class), accessibility hints (identifier) and most importantly, boundary. mobile-mcp will translate results from both tools into a simple json that easy for LLM to understand.

This is the most preferred way of clicking an element on screen. Your model will be provided with x,y,width,height of each element, and may click on any pixel within those boundaries using mobile_click_on_screen_at_coordinates.

When automating using mobile-mcp, make sure your prompt allows LLM to understand what it needs to click on. You can ask it to List elements on screen so you better know which identifier to use when clicking. Most of the time the LLM is very much capable to figure out which element to click on, based on the json in the response, but for deterministic results, use identifier.

Option 2: Using screenshot

Sometimes the developer of the app didn't add accessibility hints (identifiers) for the elements you're interested in interacting with. While LLM like Claude, Gemini and OpenAI are capable of recognizing the right element, they often fail giving the exact coordinates within the screenshot. Furthermore, prompting it multiple times for Get the screen boundaries of the LIKE image might give you different results on each call.

Clone this wiki locally