Skip to content

Conversation

@nitpicker55555
Copy link
Collaborator

@nitpicker55555 nitpicker55555 commented Feb 1, 2026

Description

Based on Google search experiments across four paginated news result pages, the DOM-based interaction mode consumes approximately 2.5× more tokens than the current pure pixel-based visual interaction mode. For the latest multimodal models, including GPT-5.2 and Gemini 2.5 Pro/Flash and Gemini 3 Pro/Flash, image and text tokens are priced identically.

This PR introduces two key improvements: a screenshot pixel scale and a misclick fallback mechanism. When a click is performed on a non-canvas element and no snapshot change is detected, indicating an incorrect click position, the system invokes the original get_som_screenshot element coordinate function to retrieve the five closest DOM elements to the click location and provides them to the agent for position adjustment.

Experiments show that without the screenshot pixel scale, GPT-5.2 and Gemini-2.5-Pro often fail to click the Google search box reliably. With the pixel scale enabled, the success rate approaches 100%.

The misclick fallback mechanism is also critical. Using only a screenshot of the clicked position as error feedback is ineffective and typically requires multiple iterations for correction. In contrast, identifying the nearest DOM elements to the click position significantly improves fallback efficiency and greatly increases the success rate.

index_0201223555

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

@nitpicker55555 nitpicker55555 self-assigned this Feb 1, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Walkthrough

This PR extends the hybrid browser toolkit with pixel-based interaction modes, adds ruler overlay visualization for full visual mode screenshots, implements cross-frame element location and coordinate enrichment, introduces nearest-element detection for click feedback, and refactors APIs (browser_click, browser_type, browser_mouse_drag, mouse_drag) to support both reference-based and coordinate-based inputs across Python and TypeScript layers with mode-aware wrappers.

Changes

Cohort / File(s) Summary
Python Toolkit - Full Visual Mode & Mode-Aware Wrappers
camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py
Added _add_rulers_to_image for rendering horizontal/vertical ruler guides on images; extended browser_get_screenshot and browser_get_som_screenshot to process images with rulers and return ToolResult when read_image is enabled. Introduced _TOOLS_EXCLUDED_IN_VISUAL_MODE and _create_mode_wrapper to dynamically generate mode-specific tool wrappers. Extended browser_click, browser_type, and browser_mouse_drag to support both reference-based and pixel-coordinate inputs with validation. Replaced internal _send_command calls with batch_keyboard_input for keyboard operations. Modified get_tools to apply mode-specific wrappers and exclude tools in full visual mode.
TypeScript Browser Session - Cross-Frame Lookup & Coordinate Enrichment
camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts
Added cross-frame element search utilities (parseFrameRef, tryFindInFrame, findElementAcrossFrames) to locate elements in iframes with fallback to full-frame search. Introduced coordinate enrichment workflow (getElementCoordinates) with batched parallelized processing to update element metadata with coordinates and frame context. Added proximity utilities (distanceToElement, findNearestElements, parseElementNamesFromSnapshot, formatNearestElementsMessage). Enhanced click/type/select/drag/mouse_control action implementations to use cross-frame resolution and return feedback (diffSnapshot or resultMessage) when interactions have no effect. Extended takeScreenshot return value to include viewport dimensions.
TypeScript Types & Interactive Element Detection
camel/toolkits/hybrid_browser_toolkit/ts/src/types.ts
Added NearestElementInfo interface for nearest element metadata; exported INTERACTIVE_ROLES constant defining interactive element roles for click detection. Made MouseDragAction.from_ref and to_ref optional and added pixel-coordinate fields (from_x, from_y, to_x, to_y) for dual-mode drag support.
TypeScript Toolkit & SOM Screenshot
camel/toolkits/hybrid_browser_toolkit/ts/src/hybrid-browser-toolkit.ts, camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts
Added getScreenshot method on HybridBrowserToolkit with viewport and timing metadata. Enhanced parseClickableElements to detect interactive elements via [ref=...] patterns, cursor styling, and INTERACTIVE_ROLES. Refactored mouseDrag signature to accept flexible parameter object supporting both ref-based and coordinate-based inputs. Updated som-screenshot-injected.ts with iframe-aware visibility logic for enriched elements; removed filter timing measurements.
Configuration & WebSocket Communication
camel/toolkits/hybrid_browser_toolkit/ts/src/config-loader.ts, camel/toolkits/hybrid_browser_toolkit/ts/websocket-server.js
Added nearestElementsCount field to WebSocketConfig interface with default value of 5. Added get_screenshot command handler in WebSocket server. Refactored mouse_drag command to accept single params object instead of separate from_ref/to_ref arguments.
Python WebSocket Wrapper
camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py
Refactored mouse_drag signature to support dual-mode input (ref-based or coordinate-based) with validation. Added new batch_keyboard_input method for executing batched keyboard operations. Updated get_screenshot and get_som_screenshot to return ToolResult objects constructed from response text and images; updated logging and docstrings accordingly.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant TS as TypeScript<br/>Browser Session
    participant CrossFrame as Cross-Frame<br/>Lookup
    participant Coord as Coordinate<br/>Enrichment
    participant Click as Click<br/>Executor
    participant Feedback as Feedback<br/>Generator

    User->>TS: Click action (ref or x,y)
    TS->>CrossFrame: Parse ref or search by position
    CrossFrame->>CrossFrame: tryFindInFrame (frame-prefixed)
    CrossFrame->>CrossFrame: findElementAcrossFrames (fallback)
    CrossFrame-->>TS: Element found + frame context
    
    TS->>Coord: Get element coordinates & viewport
    Coord->>Coord: Batch parallelize coordinate retrieval
    Coord-->>TS: Elements with x, y, frameIndex, frameUrl
    
    TS->>Click: Execute click at resolved position
    Click-->>TS: Click result / no visible effect
    
    alt Click had effect
        TS-->>User: Return success + new snapshot
    else Ineffective click
        TS->>Feedback: Find nearest interactive elements
        Feedback->>Feedback: Calculate distance to candidates
        Feedback-->>TS: Nearest elements list
        TS-->>User: Return guidance message + nearest elements
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fix: browser_data_dir #3097: Overlapping changes to mouse_drag signature, pixel coordinate support, and screenshot/get_screenshot wiring in hybrid browser toolkit.
  • enhance:hybrid browser #2825: Direct code-level overlap in mouse_drag/mouseDrag signatures with pixel coordinates, INTERACTIVE_ROLES integration, and screenshot/get_screenshot pipeline modifications.
  • enhance: browser_som_screenshot #3059: Shared modifications to fullVisualMode/SOM screenshot pipeline including config-loader and related type/flag wiring.

Suggested labels

enhancement, Tool

Suggested reviewers

  • hu-xianglong
  • Wendong-Fan
  • Ishneet0710

Poem

🐰 A rabbit hops through frames with grace,
Rulers mark each pixel place,
Pixels now speak where refs once did dwell,
Nearest helpers answer when clicks don't land well,
Visual mode blooms—what a tale to tell! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feat: browser visual pixel mode' directly summarizes the main feature being introduced across multiple files—support for pixel-based coordinate interactions alongside existing ref-based interactions in the browser toolkit.
Docstring Coverage ✅ Passed Docstring coverage is 96.43% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description clearly relates to the changeset, explaining the browser visual pixel mode feature with specific improvements like screenshot pixel scale and misclick fallback mechanism.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/browser_visual_pixel_mode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the Review Required PR need to be reviewed label Feb 1, 2026
@nitpicker55555
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts`:
- Around line 701-722: When enriching elements with coordinates in the
includeCoordinates block, if parseFrameRef(ref) yields frameIndex 0 but
findElementAcrossFrames(ref) returns a locator from a non-main frame, derive and
store the actual frame info from the resolved result instead of relying solely
on the ref format; update the logic inside the Promise.all mapping (where
findElementAcrossFrames, getElementCoordinates, parseFrameRef, and
playwrightMapping are used) to set frameIndex and frameUrl from result.frame
(e.g., detect whether result.frame is mainFrame or get its index/URL) whenever
the parsed frameIndex is missing/zero so downstream iframe visibility handling
receives correct frame context.
- Around line 1543-1595: The pixel-mode branch in performMouseDrag currently
uses raw from_x/from_y/to_x/to_y without bounds validation; add a check (similar
to mouse_control) that retrieves the page viewport (e.g., via
page.viewportSize() or your existing viewport helper) and clamps or rejects
coordinates outside [0,width) and [0,height) before calling showClickHighlight
and performing the drag; if coordinates are out of bounds return a clear error
from performMouseDrag, otherwise proceed as before.

In `@camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts`:
- Around line 49-67: The current logic in the visibility check (using
elementInfo.frameIndex/frameUrl, coords, and document.elementsFromPoint)
defaults to returning 'visible' when no iframe is found at the element center,
which can produce false positives; change the fallback so that when
elementInfo?.frameIndex > 0 || elementInfo?.frameUrl is true but no IFRAME
element is found at centerX/centerY, you return 'partial' (or 'hidden' per
policy) instead of 'visible', while still returning 'visible' immediately if an
IFRAME element is detected; update the branch in som-screenshot-injected.ts that
uses elementInfo, coords, centerX/centerY, and elementsAtCenter to implement
this fallback.

In `@camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py`:
- Around line 912-936: The mouse_drag implementation allows mixed inputs where
refs and coordinates are both provided (the coordinate branch currently wins);
change mouse_drag to enforce exclusivity by detecting when from_ref or to_ref is
provided together with any of from_x, from_y, to_x, to_y and raising a
ValueError, keep the existing branches for the two valid cases (both from_ref
and to_ref OR all four coordinates) and then call
self._send_command('mouse_drag', params) as before.
🧹 Nitpick comments (9)
camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts (1)

548-549: Timing/filtered metrics are now hard-coded to zero.

This makes telemetry look like no filtering happened. If these metrics still matter, consider passing real values from the caller or removing the fields to avoid confusion.

camel/toolkits/hybrid_browser_toolkit/ts/src/config-loader.ts (1)

223-227: Clamp nearestElementsCount to a non‑negative integer.

Negative/NaN values can lead to surprising slices downstream. Consider normalizing on ingest.

💡 Suggested change
-    if (config.nearestElementsCount !== undefined) wsConfig.nearestElementsCount = config.nearestElementsCount;
+    if (config.nearestElementsCount !== undefined) {
+      const count = Number(config.nearestElementsCount);
+      if (Number.isFinite(count)) {
+        wsConfig.nearestElementsCount = Math.max(0, Math.floor(count));
+      }
+    }
camel/toolkits/hybrid_browser_toolkit/ts/src/hybrid-browser-toolkit.ts (1)

235-263: Clarify coordinate guidance when full‑page screenshots are enabled.

If fullPageScreenshot is true, the image can exceed viewport size but the message still advertises viewport bounds. Consider explicitly stating that coordinates remain viewport‑relative (or include full‑page dimensions) to prevent misclicks.

camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts (1)

1460-1534: Skip expensive snapshotting when nearest‑element suggestions are disabled.

Right now every non‑canvas click builds a full snapshot+coordinates even if nearestElementsCount is 0. Consider gating that work on nearestElementsCount > 0 to reduce latency in non‑visual mode.

camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py (5)

82-91: Font type annotation uses Python 3.10+ union syntax.

The type annotation ImageFont.FreeTypeFont | ImageFont.ImageFont on line 82 uses the union syntax that requires Python 3.10+ at runtime (unless from __future__ import annotations is present at module level, which it is not). Consider using Union from typing for broader compatibility.

Additionally, the font fallback chain assumes specific system paths. The macOS-specific path /System/Library/Fonts/Helvetica.ttc won't exist on Linux servers where this toolkit might commonly run.

♻️ Suggested compatibility improvement
+from typing import Union
+
 def _add_rulers_to_image(
     image_bytes: bytes, tick_interval: int = 100
 ) -> bytes:
     ...
     # Try to load a font, fall back to default if not available
-    font: ImageFont.FreeTypeFont | ImageFont.ImageFont
+    font: Union[ImageFont.FreeTypeFont, ImageFont.ImageFont]
     try:
-        font = ImageFont.truetype(
-            "/System/Library/Fonts/Helvetica.ttc", font_size
-        )
+        # Try common cross-platform font paths
+        font = ImageFont.truetype("DejaVuSans.ttf", font_size)
     except (IOError, OSError):
         try:
-            font = ImageFont.truetype("arial.ttf", font_size)
+            font = ImageFont.truetype("/System/Library/Fonts/Helvetica.ttc", font_size)
         except (IOError, OSError):
-            font = ImageFont.load_default()
+            try:
+                font = ImageFont.truetype("arial.ttf", font_size)
+            except (IOError, OSError):
+                font = ImageFont.load_default()

929-942: Potential bug: processed_images may be empty when checking images to return.

If result.images has images but the image doesn't start with 'data:image/png;base64,' (line 902), processed_images will remain empty. The condition at line 935-938 would then fall back to result.images, which is correct, but the logic is fragile.

Also, the loop iterates with enumerate but discards the index and breaks after the first match. A direct approach would be clearer.

♻️ Suggested simplification
                 # Process images and optionally add rulers
                 processed_images = []
-                for _, image_data in enumerate(result.images):
+                for image_data in result.images:
                     if image_data.startswith('data:image/png;base64,'):
                         base64_data = image_data.split(',', 1)[1]
                         image_bytes = base64.b64decode(base64_data)
                         # ... rest of processing ...
                         break
+                    # Handle non-PNG images if needed

973-980: Validation logic is correct but could be more explicit.

The validation correctly requires either ref or both x and y. The current priority (x,y check first) means pixel mode takes precedence when all three are provided, which may not be intentional.

♻️ Optional: Add explicit conflict check
             if x is not None and y is not None:
+                if ref is not None:
+                    logger.warning(
+                        "Both ref and x,y provided to browser_click; using pixel mode"
+                    )
                 result = await ws_wrapper.mouse_control('click', x, y)
             elif ref is not None:
                 result = await ws_wrapper.click(ref)

1016-1022: Hardcoded sleep duration for focus wait may be unreliable.

The asyncio.sleep(0.1) at line 1019 is a fixed delay that may be insufficient for slow-loading pages or excessive for fast ones. Consider making this configurable or using a more robust focus detection mechanism.

♻️ Suggested improvement
             if x is not None and y is not None and text is not None:
                 # Pixel mode: click to focus, then type using keyboard
                 await ws_wrapper.mouse_control('click', x, y)
-                await asyncio.sleep(0.1)  # Wait for focus
+                # Wait for focus - this delay may need tuning based on page responsiveness
+                await asyncio.sleep(0.15)
                 result = await ws_wrapper.batch_keyboard_input(
                     [{"type": "type", "text": text, "delay": 0}]
                 )

Alternatively, consider adding a focus_delay parameter to the method signature for configurability.


946-948: Consider using logger.exception for better error diagnostics.

As flagged by static analysis, logger.exception automatically includes the stack trace, which is helpful for debugging screenshot capture failures.

♻️ Suggested improvement
         except Exception as e:
-            logger.error(f"Failed to get screenshot: {e}")
+            logger.exception("Failed to get screenshot")
             return f"Error capturing screenshot: {e}"

@nitpicker55555 nitpicker55555 linked an issue Feb 1, 2026 that may be closed by this pull request
2 tasks
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nitpicker55555 !

@Wendong-Fan Wendong-Fan added this to the Sprint 48 milestone Feb 3, 2026
@Wendong-Fan Wendong-Fan removed the Review Required PR need to be reviewed label Feb 3, 2026
@Wendong-Fan Wendong-Fan merged commit 400c01b into master Feb 3, 2026
11 of 12 checks passed
@Wendong-Fan Wendong-Fan deleted the feat/browser_visual_pixel_mode branch February 3, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Browser visual Interaction based on pixels

2 participants