Skip to content

feat(core): support searchArea for extract-type methods to improve accuracy on small elements#2439

Open
Postroggy wants to merge 1 commit into
web-infra-dev:mainfrom
Postroggy:feat/aiassert-region-focus
Open

feat(core): support searchArea for extract-type methods to improve accuracy on small elements#2439
Postroggy wants to merge 1 commit into
web-infra-dev:mainfrom
Postroggy:feat/aiassert-region-focus

Conversation

@Postroggy
Copy link
Copy Markdown
Contributor

Summary

When aiAssert / aiBoolean / aiQuery evaluate a full-screen screenshot, small target elements (e.g. a 40×40px icon) occupy too little of the image for the model to judge accurately.

Reproduce data:

Model Image Accuracy
doubao-seed-2.0 full 1280×800 10% (1/10)
doubao-seed-2.0 cropped top 12.5% (1280×100) 70% (7/10)

Same model, same prompt, same image — cropping to the target region alone raised accuracy from 10% to 70%.

This PR adds an optional searchArea field to ServiceExtractOption, letting extract-type methods (aiAssert, aiBoolean, aiQuery, aiNumber, aiString) crop the screenshot to a sub-region before sending it to the model. A companion locatePrompt field on AgentAssertOpt and MidsceneYamlFlowItemAIAssert lets users describe the target in natural language; aiLocate resolves it to a rect automatically. This is especially useful in YAML scripts where passing coordinates directly is error-prone — users only need to describe what they're looking for, and the model figures out where it is.


Motivation

Vision models struggle when the target element is small relative to the full screenshot. The issue is not the model's capability, but the signal-to-noise ratio: a 40×40px element on a 1280×800 screen takes up only ~0.15% of the image area, making subtle visual details (e.g. the direction an icon is facing) nearly impossible to judge reliably.

The fix is simple: crop the screenshot to the relevant region before sending it to the model. This PR makes that possible without any changes to prompts or the model-calling layer.


Changes

New field: searchArea on ServiceExtractOption

All extract-type methods (aiAssert, aiBoolean, aiQuery, aiNumber, aiString) accept an optional searchArea: Rect. When provided, the screenshot is cropped to that region before being sent to the model. When screenshotIncluded is false, the crop is skipped because there is no image to forward.

Scenario: you already know the coordinates (TypeScript)

// Assert on a small icon at a known position
await agent.aiAssert('眼睛朝向右侧', 'IP眼睛未朝右', {
  searchArea: { left: 580, top: 20, width: 120, height: 80 },
});

// Boolean check on a badge counter
await agent.aiBoolean('购物车角标数字大于 0', {
  searchArea: { left: 1100, top: 0, width: 180, height: 80 },
});

// Query within a specific panel
await agent.aiQuery('提取当前价格', {
  searchArea: { left: 0, top: 600, width: 400, height: 200 },
});

New field: locatePrompt on AgentAssertOpt

For cases where the exact rect is unknown, locatePrompt accepts a natural-language description. Internally, aiAssert calls aiLocate (which supports deepLocate for small or hard-to-find elements) to resolve the rect, then uses it as searchArea. This avoids the need for users to manually calculate screenshot-space coordinates.

Scenario: you don't know the coordinates, describe the target instead (TypeScript)

// aiLocate resolves the element rect automatically
await agent.aiAssert('眼睛朝向右侧', 'IP眼睛未朝右', {
  locatePrompt: '绿色 IP 吉祥物形象',
});

Scenario: YAML scripts — locatePrompt is the only practical option

In YAML there is no way to run aiLocate and pass its result to a subsequent step, and writing raw screenshot-space coordinates by hand is error-prone. locatePrompt solves both problems:

- aiAssert: 小 IP 形象的眼睛朝向右侧方向
  locatePrompt: 绿色 IP 吉祥物形象
  errorMessage: IP 眼睛未朝右转动

Implementation

File Change
packages/core/src/yaml.ts ServiceExtractOption adds searchArea?: Rect; MidsceneYamlFlowItemAIAssert adds locatePrompt?: string
packages/core/src/types.ts AgentAssertOpt adds locatePrompt?: string
packages/core/src/agent/agent.ts aiAssert resolves locatePromptaiLocatesearchArea; both are skipped when screenshotIncluded is false
packages/core/src/ai-model/inspect.ts AiExtractElementInfo crops screenshot via existing cropByRect when searchArea is set and screenshotIncluded !== false

No breaking changes — all new fields are optional.


Validation

  • pnpm run lint
  • npx nx build @midscene/core
  • npx nx test @midscene/core — 825 passed

New test file packages/core/tests/unit-test/aiassert-search-area.test.ts covers:

  • No crop when searchArea is absent
  • cropByRect called with correct rect when searchArea is provided
  • No crop when screenshotIncluded: false even if searchArea is set
  • Cropped imageBase64 forwarded to model instead of original full screenshot

Add optional `region` (Rect) to `ServiceExtractOption` and `focusLocate`
(string) to `AgentAssertOpt` / `MidsceneYamlFlowItemAIAssert`. When a
region is provided the screenshot is cropped before being sent to the
model, improving accuracy on small target elements from ~10% to ~70%.

- `region` is forwarded through the extract pipeline and applied in
  `AiExtractElementInfo` via the existing `cropByRect` utility.
- `focusLocate` triggers an `aiLocate` call first to derive the rect
  automatically from a natural-language description.
- No breaking changes: all new options are optional.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant