Feat: browser visual pixel mode #3767

nitpicker55555 · 2026-02-01T21:36:33Z

Description

Based on Google search experiments across four paginated news result pages, the DOM-based interaction mode consumes approximately 2.5× more tokens than the current pure pixel-based visual interaction mode. For the latest multimodal models, including GPT-5.2 and Gemini 2.5 Pro/Flash and Gemini 3 Pro/Flash, image and text tokens are priced identically.

This PR introduces two key improvements: a screenshot pixel scale and a misclick fallback mechanism. When a click is performed on a non-canvas element and no snapshot change is detected, indicating an incorrect click position, the system invokes the original get_som_screenshot element coordinate function to retrieve the five closest DOM elements to the click location and provides them to the agent for position adjustment.

Experiments show that without the screenshot pixel scale, GPT-5.2 and Gemini-2.5-Pro often fail to click the Google search box reliably. With the pixel scale enabled, the success rate approaches 100%.

The misclick fallback mechanism is also critical. Using only a screenshot of the clicked position as error feedback is ineffective and typically requires multiple iterations for correction. In contrast, identifying the nearest DOM elements to the click position significantly improves fallback efficiency and greatly increases the success rate.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
I have updated the tests accordingly (required for a bug fix or a new feature)
I have updated the documentation if needed:
I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

coderabbitai · 2026-02-01T21:36:42Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

Walkthrough

This PR extends the hybrid browser toolkit with pixel-based interaction modes, adds ruler overlay visualization for full visual mode screenshots, implements cross-frame element location and coordinate enrichment, introduces nearest-element detection for click feedback, and refactors APIs (browser_click, browser_type, browser_mouse_drag, mouse_drag) to support both reference-based and coordinate-based inputs across Python and TypeScript layers with mode-aware wrappers.

Changes

Cohort / File(s)	Summary
Python Toolkit - Full Visual Mode & Mode-Aware Wrappers `camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py`	Added `_add_rulers_to_image` for rendering horizontal/vertical ruler guides on images; extended `browser_get_screenshot` and `browser_get_som_screenshot` to process images with rulers and return `ToolResult` when `read_image` is enabled. Introduced `_TOOLS_EXCLUDED_IN_VISUAL_MODE` and `_create_mode_wrapper` to dynamically generate mode-specific tool wrappers. Extended `browser_click`, `browser_type`, and `browser_mouse_drag` to support both reference-based and pixel-coordinate inputs with validation. Replaced internal `_send_command` calls with `batch_keyboard_input` for keyboard operations. Modified `get_tools` to apply mode-specific wrappers and exclude tools in full visual mode.
TypeScript Browser Session - Cross-Frame Lookup & Coordinate Enrichment `camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts`	Added cross-frame element search utilities (`parseFrameRef`, `tryFindInFrame`, `findElementAcrossFrames`) to locate elements in iframes with fallback to full-frame search. Introduced coordinate enrichment workflow (`getElementCoordinates`) with batched parallelized processing to update element metadata with coordinates and frame context. Added proximity utilities (`distanceToElement`, `findNearestElements`, `parseElementNamesFromSnapshot`, `formatNearestElementsMessage`). Enhanced click/type/select/drag/mouse_control action implementations to use cross-frame resolution and return feedback (diffSnapshot or resultMessage) when interactions have no effect. Extended `takeScreenshot` return value to include viewport dimensions.
TypeScript Types & Interactive Element Detection `camel/toolkits/hybrid_browser_toolkit/ts/src/types.ts`	Added `NearestElementInfo` interface for nearest element metadata; exported `INTERACTIVE_ROLES` constant defining interactive element roles for click detection. Made `MouseDragAction.from_ref` and `to_ref` optional and added pixel-coordinate fields (`from_x`, `from_y`, `to_x`, `to_y`) for dual-mode drag support.
TypeScript Toolkit & SOM Screenshot `camel/toolkits/hybrid_browser_toolkit/ts/src/hybrid-browser-toolkit.ts`, `camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts`	Added `getScreenshot` method on HybridBrowserToolkit with viewport and timing metadata. Enhanced `parseClickableElements` to detect interactive elements via `[ref=...]` patterns, cursor styling, and `INTERACTIVE_ROLES`. Refactored `mouseDrag` signature to accept flexible parameter object supporting both ref-based and coordinate-based inputs. Updated `som-screenshot-injected.ts` with iframe-aware visibility logic for enriched elements; removed filter timing measurements.
Configuration & WebSocket Communication `camel/toolkits/hybrid_browser_toolkit/ts/src/config-loader.ts`, `camel/toolkits/hybrid_browser_toolkit/ts/websocket-server.js`	Added `nearestElementsCount` field to `WebSocketConfig` interface with default value of 5. Added `get_screenshot` command handler in WebSocket server. Refactored `mouse_drag` command to accept single params object instead of separate from_ref/to_ref arguments.
Python WebSocket Wrapper `camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py`	Refactored `mouse_drag` signature to support dual-mode input (ref-based or coordinate-based) with validation. Added new `batch_keyboard_input` method for executing batched keyboard operations. Updated `get_screenshot` and `get_som_screenshot` to return `ToolResult` objects constructed from response text and images; updated logging and docstrings accordingly.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant TS as TypeScript<br/>Browser Session
    participant CrossFrame as Cross-Frame<br/>Lookup
    participant Coord as Coordinate<br/>Enrichment
    participant Click as Click<br/>Executor
    participant Feedback as Feedback<br/>Generator

    User->>TS: Click action (ref or x,y)
    TS->>CrossFrame: Parse ref or search by position
    CrossFrame->>CrossFrame: tryFindInFrame (frame-prefixed)
    CrossFrame->>CrossFrame: findElementAcrossFrames (fallback)
    CrossFrame-->>TS: Element found + frame context
    
    TS->>Coord: Get element coordinates & viewport
    Coord->>Coord: Batch parallelize coordinate retrieval
    Coord-->>TS: Elements with x, y, frameIndex, frameUrl
    
    TS->>Click: Execute click at resolved position
    Click-->>TS: Click result / no visible effect
    
    alt Click had effect
        TS-->>User: Return success + new snapshot
    else Ineffective click
        TS->>Feedback: Find nearest interactive elements
        Feedback->>Feedback: Calculate distance to candidates
        Feedback-->>TS: Nearest elements list
        TS-->>User: Return guidance message + nearest elements
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: browser_data_dir #3097: Overlapping changes to mouse_drag signature, pixel coordinate support, and screenshot/get_screenshot wiring in hybrid browser toolkit.
enhance:hybrid browser #2825: Direct code-level overlap in mouse_drag/mouseDrag signatures with pixel coordinates, INTERACTIVE_ROLES integration, and screenshot/get_screenshot pipeline modifications.
enhance: browser_som_screenshot #3059: Shared modifications to fullVisualMode/SOM screenshot pipeline including config-loader and related type/flag wiring.

Suggested labels

enhancement, Tool

Suggested reviewers

hu-xianglong
Wendong-Fan
Ishneet0710

Poem

🐰 A rabbit hops through frames with grace,
Rulers mark each pixel place,
Pixels now speak where refs once did dwell,
Nearest helpers answer when clicks don't land well,
Visual mode blooms—what a tale to tell! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feat: browser visual pixel mode' directly summarizes the main feature being introduced across multiple files—support for pixel-based coordinate interactions alongside existing ref-based interactions in the browser toolkit.
Docstring Coverage	✅ Passed	Docstring coverage is 96.43% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description clearly relates to the changeset, explaining the browser visual pixel mode feature with specific improvements like screenshot pixel scale and misclick fallback mechanism.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/browser_visual_pixel_mode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nitpicker55555 · 2026-02-01T21:38:43Z

@coderabbitai review

coderabbitai · 2026-02-01T21:38:52Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts`:
- Around line 701-722: When enriching elements with coordinates in the
includeCoordinates block, if parseFrameRef(ref) yields frameIndex 0 but
findElementAcrossFrames(ref) returns a locator from a non-main frame, derive and
store the actual frame info from the resolved result instead of relying solely
on the ref format; update the logic inside the Promise.all mapping (where
findElementAcrossFrames, getElementCoordinates, parseFrameRef, and
playwrightMapping are used) to set frameIndex and frameUrl from result.frame
(e.g., detect whether result.frame is mainFrame or get its index/URL) whenever
the parsed frameIndex is missing/zero so downstream iframe visibility handling
receives correct frame context.
- Around line 1543-1595: The pixel-mode branch in performMouseDrag currently
uses raw from_x/from_y/to_x/to_y without bounds validation; add a check (similar
to mouse_control) that retrieves the page viewport (e.g., via
page.viewportSize() or your existing viewport helper) and clamps or rejects
coordinates outside [0,width) and [0,height) before calling showClickHighlight
and performing the drag; if coordinates are out of bounds return a clear error
from performMouseDrag, otherwise proceed as before.

In `@camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts`:
- Around line 49-67: The current logic in the visibility check (using
elementInfo.frameIndex/frameUrl, coords, and document.elementsFromPoint)
defaults to returning 'visible' when no iframe is found at the element center,
which can produce false positives; change the fallback so that when
elementInfo?.frameIndex > 0 || elementInfo?.frameUrl is true but no IFRAME
element is found at centerX/centerY, you return 'partial' (or 'hidden' per
policy) instead of 'visible', while still returning 'visible' immediately if an
IFRAME element is detected; update the branch in som-screenshot-injected.ts that
uses elementInfo, coords, centerX/centerY, and elementsAtCenter to implement
this fallback.

In `@camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py`:
- Around line 912-936: The mouse_drag implementation allows mixed inputs where
refs and coordinates are both provided (the coordinate branch currently wins);
change mouse_drag to enforce exclusivity by detecting when from_ref or to_ref is
provided together with any of from_x, from_y, to_x, to_y and raising a
ValueError, keep the existing branches for the two valid cases (both from_ref
and to_ref OR all four coordinates) and then call
self._send_command('mouse_drag', params) as before.

🧹 Nitpick comments (9)

camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts (1)

548-549: Timing/filtered metrics are now hard-coded to zero.

This makes telemetry look like no filtering happened. If these metrics still matter, consider passing real values from the caller or removing the fields to avoid confusion.
camel/toolkits/hybrid_browser_toolkit/ts/src/config-loader.ts (1)
223-227: Clamp nearestElementsCount to a non‑negative integer.

Negative/NaN values can lead to surprising slices downstream. Consider normalizing on ingest.
💡 Suggested change
-    if (config.nearestElementsCount !== undefined) wsConfig.nearestElementsCount = config.nearestElementsCount;
+    if (config.nearestElementsCount !== undefined) {
+      const count = Number(config.nearestElementsCount);
+      if (Number.isFinite(count)) {
+        wsConfig.nearestElementsCount = Math.max(0, Math.floor(count));
+      }
+    }
camel/toolkits/hybrid_browser_toolkit/ts/src/hybrid-browser-toolkit.ts (1)

235-263: Clarify coordinate guidance when full‑page screenshots are enabled.

If fullPageScreenshot is true, the image can exceed viewport size but the message still advertises viewport bounds. Consider explicitly stating that coordinates remain viewport‑relative (or include full‑page dimensions) to prevent misclicks.

camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts (1)

1460-1534: Skip expensive snapshotting when nearest‑element suggestions are disabled.

Right now every non‑canvas click builds a full snapshot+coordinates even if nearestElementsCount is 0. Consider gating that work on nearestElementsCount > 0 to reduce latency in non‑visual mode.
camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py (5)
82-91: Font type annotation uses Python 3.10+ union syntax.

The type annotation ImageFont.FreeTypeFont | ImageFont.ImageFont on line 82 uses the union syntax that requires Python 3.10+ at runtime (unless from __future__ import annotations is present at module level, which it is not). Consider using Union from typing for broader compatibility.

Additionally, the font fallback chain assumes specific system paths. The macOS-specific path /System/Library/Fonts/Helvetica.ttc won't exist on Linux servers where this toolkit might commonly run.
♻️ Suggested compatibility improvement
+from typing import Union
+
 def _add_rulers_to_image(
     image_bytes: bytes, tick_interval: int = 100
 ) -> bytes:
     ...
     # Try to load a font, fall back to default if not available
-    font: ImageFont.FreeTypeFont | ImageFont.ImageFont
+    font: Union[ImageFont.FreeTypeFont, ImageFont.ImageFont]
     try:
-        font = ImageFont.truetype(
-            "/System/Library/Fonts/Helvetica.ttc", font_size
-        )
+        # Try common cross-platform font paths
+        font = ImageFont.truetype("DejaVuSans.ttf", font_size)
     except (IOError, OSError):
         try:
-            font = ImageFont.truetype("arial.ttf", font_size)
+            font = ImageFont.truetype("/System/Library/Fonts/Helvetica.ttc", font_size)
         except (IOError, OSError):
-            font = ImageFont.load_default()
+            try:
+                font = ImageFont.truetype("arial.ttf", font_size)
+            except (IOError, OSError):
+                font = ImageFont.load_default()
929-942: Potential bug: processed_images may be empty when checking images to return.

If result.images has images but the image doesn't start with 'data:image/png;base64,' (line 902), processed_images will remain empty. The condition at line 935-938 would then fall back to result.images, which is correct, but the logic is fragile.

Also, the loop iterates with enumerate but discards the index and breaks after the first match. A direct approach would be clearer.
♻️ Suggested simplification
                 # Process images and optionally add rulers
                 processed_images = []
-                for _, image_data in enumerate(result.images):
+                for image_data in result.images:
                     if image_data.startswith('data:image/png;base64,'):
                         base64_data = image_data.split(',', 1)[1]
                         image_bytes = base64.b64decode(base64_data)
                         # ... rest of processing ...
                         break
+                    # Handle non-PNG images if needed
973-980: Validation logic is correct but could be more explicit.

The validation correctly requires either ref or both x and y. The current priority (x,y check first) means pixel mode takes precedence when all three are provided, which may not be intentional.
♻️ Optional: Add explicit conflict check
             if x is not None and y is not None:
+                if ref is not None:
+                    logger.warning(
+                        "Both ref and x,y provided to browser_click; using pixel mode"
+                    )
                 result = await ws_wrapper.mouse_control('click', x, y)
             elif ref is not None:
                 result = await ws_wrapper.click(ref)
1016-1022: Hardcoded sleep duration for focus wait may be unreliable.

The asyncio.sleep(0.1) at line 1019 is a fixed delay that may be insufficient for slow-loading pages or excessive for fast ones. Consider making this configurable or using a more robust focus detection mechanism.
♻️ Suggested improvement
             if x is not None and y is not None and text is not None:
                 # Pixel mode: click to focus, then type using keyboard
                 await ws_wrapper.mouse_control('click', x, y)
-                await asyncio.sleep(0.1)  # Wait for focus
+                # Wait for focus - this delay may need tuning based on page responsiveness
+                await asyncio.sleep(0.15)
                 result = await ws_wrapper.batch_keyboard_input(
                     [{"type": "type", "text": text, "delay": 0}]
                 )
Alternatively, consider adding a focus_delay parameter to the method signature for configurability.
946-948: Consider using logger.exception for better error diagnostics.

As flagged by static analysis, logger.exception automatically includes the stack trace, which is helpful for debugging screenshot capture failures.
♻️ Suggested improvement
         except Exception as e:
-            logger.error(f"Failed to get screenshot: {e}")
+            logger.exception("Failed to get screenshot")
             return f"Error capturing screenshot: {e}"

camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts

camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts

camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py

Wendong-Fan

thanks @nitpicker55555 !

nitpicker55555 added 12 commits January 27, 2026 01:28

optimize code

0936194

optimize code

5cc7c31

highlight coordinate

0f20ef0

update viewport coodinates and fix logs

9b0b448

add ruler in screenshot

274a4f9

chore:som screeshot cross frame

27b5b34

chore:som screeshot cross frame

e7866fa

chore:som screeshot cross frame parallel

0caffaf

find closet 5 elements in misclick

2a4c752

chore code

3c3b643

chore code

5f3c739

chore pre-commit

27cc9a3

nitpicker55555 self-assigned this Feb 1, 2026

Merge branch 'master' into feat/browser_visual_pixel_mode

c5b6b95

github-actions bot added the Review Required PR need to be reviewed label Feb 1, 2026

coderabbitai bot reviewed Feb 1, 2026

View reviewed changes

chore

5f643c9

nitpicker55555 linked an issue Feb 1, 2026 that may be closed by this pull request

[Feature Request] Browser visual Interaction based on pixels #3680

Closed

2 tasks

nitpicker55555 and others added 3 commits February 2, 2026 21:25

fix element seeable bug

2bcc215

fix element seeable bug

9a678c4

Merge branch 'master' into feat/browser_visual_pixel_mode

68d2a8d

Wendong-Fan approved these changes Feb 3, 2026

View reviewed changes

Wendong-Fan added this to the Sprint 48 milestone Feb 3, 2026

Wendong-Fan removed the Review Required PR need to be reviewed label Feb 3, 2026

Merge branch 'master' into feat/browser_visual_pixel_mode

0897105

Wendong-Fan merged commit 400c01b into master Feb 3, 2026
11 of 12 checks passed

Wendong-Fan deleted the feat/browser_visual_pixel_mode branch February 3, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: browser visual pixel mode #3767

Feat: browser visual pixel mode #3767

nitpicker55555 commented Feb 1, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 1, 2026 •

edited

Loading

Review skipped

Uh oh!

nitpicker55555 commented Feb 1, 2026

Uh oh!

coderabbitai bot commented Feb 1, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wendong-Fan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat: browser visual pixel mode #3767

Feat: browser visual pixel mode #3767

Conversation

nitpicker55555 commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

coderabbitai bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

nitpicker55555 commented Feb 1, 2026

Uh oh!

coderabbitai bot commented Feb 1, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wendong-Fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nitpicker55555 commented Feb 1, 2026 •

edited

Loading

coderabbitai bot commented Feb 1, 2026 •

edited

Loading