-
Notifications
You must be signed in to change notification settings - Fork 441
Improve clicking accuracy
- When clicking on elements, try making your LLM use
mobile_list_elements_on_screentool. - If you are building something stable, use
identifierwhen telling LLM what element to click on. - Screenshots should be last resort, as current models are incapable of exact recognition of boundaries.
LLMs are terrific at understanding unstructured data. That could be a list of elements on screen, and that could be a screenshot. LLMs that can do image recognition as well (including OpenAI, Claude, Gemini) are able to explain what's on screen.
A prompt of Take a screenshot and explain what you're seeing on screen will most likely invoke mobile_take_screenshot and then pass the image for the LLM to analyze. It will tell you the clock time, list the icons on screen, explain the photo that's currently being viewed and so on. Probably even supporting multiple speaking languages.
When it comes to clicking an element though, accuracy is important. Clicking even 1 pixel outside the boundaries of an element, might yield unexpected results.
The tool mobile_list_elements_on_screen dumps the view hierarchy using WebDriverAgent's /source on iOS or uiautomator dump on Android. These somewhat represent a stripped down DOM of what's visible on screen.
The result includes elements and their label, name, value, type (class), accessibility hints (identifier) and most importantly, boundary. mobile-mcp will translate results from both tools into a simple json that easy for LLM to understand.
This is the most preferred way of clicking an element on screen. Your model will be provided with x,y,width,height of each element, and may click on any pixel within those boundaries using mobile_click_on_screen_at_coordinates.
When automating using mobile-mcp, make sure your prompt allows LLM to understand what it needs to click on. You can ask it to List elements on screen so you better know which identifier to use when clicking. Most of the time the LLM is very much capable to figure out which element to click on, based on the json in the response, but for deterministic results, use identifier.
Sometimes the developer of the app didn't add accessibility hints (identifiers) for the elements you're interested in interacting with. While LLM like Claude, Gemini and OpenAI are capable of recognizing the right element, they often fail giving the exact coordinates within the screenshot. Furthermore, prompting it multiple times for Get the screen boundaries of the LIKE image might give you different results on each call.
From Mobile Next: We are building the future of mobile development 📱 🚀