feat(core): support searchArea for extract-type methods to improve accuracy on small elements#2439
Open
Postroggy wants to merge 1 commit into
Open
feat(core): support searchArea for extract-type methods to improve accuracy on small elements#2439Postroggy wants to merge 1 commit into
Postroggy wants to merge 1 commit into
Conversation
Add optional `region` (Rect) to `ServiceExtractOption` and `focusLocate` (string) to `AgentAssertOpt` / `MidsceneYamlFlowItemAIAssert`. When a region is provided the screenshot is cropped before being sent to the model, improving accuracy on small target elements from ~10% to ~70%. - `region` is forwarded through the extract pipeline and applied in `AiExtractElementInfo` via the existing `cropByRect` utility. - `focusLocate` triggers an `aiLocate` call first to derive the rect automatically from a natural-language description. - No breaking changes: all new options are optional. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
aiAssert/aiBoolean/aiQueryevaluate a full-screen screenshot, small target elements (e.g. a 40×40px icon) occupy too little of the image for the model to judge accurately.Reproduce data:
Same model, same prompt, same image — cropping to the target region alone raised accuracy from 10% to 70%.
This PR adds an optional
searchAreafield toServiceExtractOption, letting extract-type methods (aiAssert,aiBoolean,aiQuery,aiNumber,aiString) crop the screenshot to a sub-region before sending it to the model. A companionlocatePromptfield onAgentAssertOptandMidsceneYamlFlowItemAIAssertlets users describe the target in natural language;aiLocateresolves it to a rect automatically. This is especially useful in YAML scripts where passing coordinates directly is error-prone — users only need to describe what they're looking for, and the model figures out where it is.Motivation
Vision models struggle when the target element is small relative to the full screenshot. The issue is not the model's capability, but the signal-to-noise ratio: a 40×40px element on a 1280×800 screen takes up only ~0.15% of the image area, making subtle visual details (e.g. the direction an icon is facing) nearly impossible to judge reliably.
The fix is simple: crop the screenshot to the relevant region before sending it to the model. This PR makes that possible without any changes to prompts or the model-calling layer.
Changes
New field:
searchAreaonServiceExtractOptionAll extract-type methods (
aiAssert,aiBoolean,aiQuery,aiNumber,aiString) accept an optionalsearchArea: Rect. When provided, the screenshot is cropped to that region before being sent to the model. WhenscreenshotIncludedisfalse, the crop is skipped because there is no image to forward.Scenario: you already know the coordinates (TypeScript)
New field:
locatePromptonAgentAssertOptFor cases where the exact rect is unknown,
locatePromptaccepts a natural-language description. Internally,aiAssertcallsaiLocate(which supportsdeepLocatefor small or hard-to-find elements) to resolve the rect, then uses it assearchArea. This avoids the need for users to manually calculate screenshot-space coordinates.Scenario: you don't know the coordinates, describe the target instead (TypeScript)
Scenario: YAML scripts —
locatePromptis the only practical optionIn YAML there is no way to run
aiLocateand pass its result to a subsequent step, and writing raw screenshot-space coordinates by hand is error-prone.locatePromptsolves both problems:Implementation
packages/core/src/yaml.tsServiceExtractOptionaddssearchArea?: Rect;MidsceneYamlFlowItemAIAssertaddslocatePrompt?: stringpackages/core/src/types.tsAgentAssertOptaddslocatePrompt?: stringpackages/core/src/agent/agent.tsaiAssertresolveslocatePrompt→aiLocate→searchArea; both are skipped whenscreenshotIncludedisfalsepackages/core/src/ai-model/inspect.tsAiExtractElementInfocrops screenshot via existingcropByRectwhensearchAreais set andscreenshotIncluded !== falseNo breaking changes — all new fields are optional.
Validation
pnpm run lintnpx nx build @midscene/corenpx nx test @midscene/core— 825 passedNew test file
packages/core/tests/unit-test/aiassert-search-area.test.tscovers:searchAreais absentcropByRectcalled with correct rect whensearchAreais providedscreenshotIncluded: falseeven ifsearchAreais setimageBase64forwarded to model instead of original full screenshot