-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Feat: browser visual pixel mode #3767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
WalkthroughThis PR extends the hybrid browser toolkit with pixel-based interaction modes, adds ruler overlay visualization for full visual mode screenshots, implements cross-frame element location and coordinate enrichment, introduces nearest-element detection for click feedback, and refactors APIs (browser_click, browser_type, browser_mouse_drag, mouse_drag) to support both reference-based and coordinate-based inputs across Python and TypeScript layers with mode-aware wrappers. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant TS as TypeScript<br/>Browser Session
participant CrossFrame as Cross-Frame<br/>Lookup
participant Coord as Coordinate<br/>Enrichment
participant Click as Click<br/>Executor
participant Feedback as Feedback<br/>Generator
User->>TS: Click action (ref or x,y)
TS->>CrossFrame: Parse ref or search by position
CrossFrame->>CrossFrame: tryFindInFrame (frame-prefixed)
CrossFrame->>CrossFrame: findElementAcrossFrames (fallback)
CrossFrame-->>TS: Element found + frame context
TS->>Coord: Get element coordinates & viewport
Coord->>Coord: Batch parallelize coordinate retrieval
Coord-->>TS: Elements with x, y, frameIndex, frameUrl
TS->>Click: Execute click at resolved position
Click-->>TS: Click result / no visible effect
alt Click had effect
TS-->>User: Return success + new snapshot
else Ineffective click
TS->>Feedback: Find nearest interactive elements
Feedback->>Feedback: Calculate distance to candidates
Feedback-->>TS: Nearest elements list
TS-->>User: Return guidance message + nearest elements
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts`:
- Around line 701-722: When enriching elements with coordinates in the
includeCoordinates block, if parseFrameRef(ref) yields frameIndex 0 but
findElementAcrossFrames(ref) returns a locator from a non-main frame, derive and
store the actual frame info from the resolved result instead of relying solely
on the ref format; update the logic inside the Promise.all mapping (where
findElementAcrossFrames, getElementCoordinates, parseFrameRef, and
playwrightMapping are used) to set frameIndex and frameUrl from result.frame
(e.g., detect whether result.frame is mainFrame or get its index/URL) whenever
the parsed frameIndex is missing/zero so downstream iframe visibility handling
receives correct frame context.
- Around line 1543-1595: The pixel-mode branch in performMouseDrag currently
uses raw from_x/from_y/to_x/to_y without bounds validation; add a check (similar
to mouse_control) that retrieves the page viewport (e.g., via
page.viewportSize() or your existing viewport helper) and clamps or rejects
coordinates outside [0,width) and [0,height) before calling showClickHighlight
and performing the drag; if coordinates are out of bounds return a clear error
from performMouseDrag, otherwise proceed as before.
In `@camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts`:
- Around line 49-67: The current logic in the visibility check (using
elementInfo.frameIndex/frameUrl, coords, and document.elementsFromPoint)
defaults to returning 'visible' when no iframe is found at the element center,
which can produce false positives; change the fallback so that when
elementInfo?.frameIndex > 0 || elementInfo?.frameUrl is true but no IFRAME
element is found at centerX/centerY, you return 'partial' (or 'hidden' per
policy) instead of 'visible', while still returning 'visible' immediately if an
IFRAME element is detected; update the branch in som-screenshot-injected.ts that
uses elementInfo, coords, centerX/centerY, and elementsAtCenter to implement
this fallback.
In `@camel/toolkits/hybrid_browser_toolkit/ws_wrapper.py`:
- Around line 912-936: The mouse_drag implementation allows mixed inputs where
refs and coordinates are both provided (the coordinate branch currently wins);
change mouse_drag to enforce exclusivity by detecting when from_ref or to_ref is
provided together with any of from_x, from_y, to_x, to_y and raising a
ValueError, keep the existing branches for the two valid cases (both from_ref
and to_ref OR all four coordinates) and then call
self._send_command('mouse_drag', params) as before.
🧹 Nitpick comments (9)
camel/toolkits/hybrid_browser_toolkit/ts/src/som-screenshot-injected.ts (1)
548-549: Timing/filtered metrics are now hard-coded to zero.This makes telemetry look like no filtering happened. If these metrics still matter, consider passing real values from the caller or removing the fields to avoid confusion.
camel/toolkits/hybrid_browser_toolkit/ts/src/config-loader.ts (1)
223-227: Clamp nearestElementsCount to a non‑negative integer.Negative/NaN values can lead to surprising slices downstream. Consider normalizing on ingest.
💡 Suggested change
- if (config.nearestElementsCount !== undefined) wsConfig.nearestElementsCount = config.nearestElementsCount; + if (config.nearestElementsCount !== undefined) { + const count = Number(config.nearestElementsCount); + if (Number.isFinite(count)) { + wsConfig.nearestElementsCount = Math.max(0, Math.floor(count)); + } + }camel/toolkits/hybrid_browser_toolkit/ts/src/hybrid-browser-toolkit.ts (1)
235-263: Clarify coordinate guidance when full‑page screenshots are enabled.If fullPageScreenshot is true, the image can exceed viewport size but the message still advertises viewport bounds. Consider explicitly stating that coordinates remain viewport‑relative (or include full‑page dimensions) to prevent misclicks.
camel/toolkits/hybrid_browser_toolkit/ts/src/browser-session.ts (1)
1460-1534: Skip expensive snapshotting when nearest‑element suggestions are disabled.Right now every non‑canvas click builds a full snapshot+coordinates even if
nearestElementsCountis 0. Consider gating that work onnearestElementsCount > 0to reduce latency in non‑visual mode.camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py (5)
82-91: Font type annotation uses Python 3.10+ union syntax.The type annotation
ImageFont.FreeTypeFont | ImageFont.ImageFonton line 82 uses the union syntax that requires Python 3.10+ at runtime (unlessfrom __future__ import annotationsis present at module level, which it is not). Consider usingUnionfromtypingfor broader compatibility.Additionally, the font fallback chain assumes specific system paths. The macOS-specific path
/System/Library/Fonts/Helvetica.ttcwon't exist on Linux servers where this toolkit might commonly run.♻️ Suggested compatibility improvement
+from typing import Union + def _add_rulers_to_image( image_bytes: bytes, tick_interval: int = 100 ) -> bytes: ... # Try to load a font, fall back to default if not available - font: ImageFont.FreeTypeFont | ImageFont.ImageFont + font: Union[ImageFont.FreeTypeFont, ImageFont.ImageFont] try: - font = ImageFont.truetype( - "/System/Library/Fonts/Helvetica.ttc", font_size - ) + # Try common cross-platform font paths + font = ImageFont.truetype("DejaVuSans.ttf", font_size) except (IOError, OSError): try: - font = ImageFont.truetype("arial.ttf", font_size) + font = ImageFont.truetype("/System/Library/Fonts/Helvetica.ttc", font_size) except (IOError, OSError): - font = ImageFont.load_default() + try: + font = ImageFont.truetype("arial.ttf", font_size) + except (IOError, OSError): + font = ImageFont.load_default()
929-942: Potential bug:processed_imagesmay be empty when checking images to return.If
result.imageshas images but the image doesn't start with'data:image/png;base64,'(line 902),processed_imageswill remain empty. The condition at line 935-938 would then fall back toresult.images, which is correct, but the logic is fragile.Also, the loop iterates with
enumeratebut discards the index and breaks after the first match. A direct approach would be clearer.♻️ Suggested simplification
# Process images and optionally add rulers processed_images = [] - for _, image_data in enumerate(result.images): + for image_data in result.images: if image_data.startswith('data:image/png;base64,'): base64_data = image_data.split(',', 1)[1] image_bytes = base64.b64decode(base64_data) # ... rest of processing ... break + # Handle non-PNG images if needed
973-980: Validation logic is correct but could be more explicit.The validation correctly requires either
refor bothxandy. The current priority (x,y check first) means pixel mode takes precedence when all three are provided, which may not be intentional.♻️ Optional: Add explicit conflict check
if x is not None and y is not None: + if ref is not None: + logger.warning( + "Both ref and x,y provided to browser_click; using pixel mode" + ) result = await ws_wrapper.mouse_control('click', x, y) elif ref is not None: result = await ws_wrapper.click(ref)
1016-1022: Hardcoded sleep duration for focus wait may be unreliable.The
asyncio.sleep(0.1)at line 1019 is a fixed delay that may be insufficient for slow-loading pages or excessive for fast ones. Consider making this configurable or using a more robust focus detection mechanism.♻️ Suggested improvement
if x is not None and y is not None and text is not None: # Pixel mode: click to focus, then type using keyboard await ws_wrapper.mouse_control('click', x, y) - await asyncio.sleep(0.1) # Wait for focus + # Wait for focus - this delay may need tuning based on page responsiveness + await asyncio.sleep(0.15) result = await ws_wrapper.batch_keyboard_input( [{"type": "type", "text": text, "delay": 0}] )Alternatively, consider adding a
focus_delayparameter to the method signature for configurability.
946-948: Consider usinglogger.exceptionfor better error diagnostics.As flagged by static analysis,
logger.exceptionautomatically includes the stack trace, which is helpful for debugging screenshot capture failures.♻️ Suggested improvement
except Exception as e: - logger.error(f"Failed to get screenshot: {e}") + logger.exception("Failed to get screenshot") return f"Error capturing screenshot: {e}"
Wendong-Fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @nitpicker55555 !
Description
Based on Google search experiments across four paginated news result pages, the DOM-based interaction mode consumes approximately 2.5× more tokens than the current pure pixel-based visual interaction mode. For the latest multimodal models, including GPT-5.2 and Gemini 2.5 Pro/Flash and Gemini 3 Pro/Flash, image and text tokens are priced identically.
This PR introduces two key improvements: a screenshot pixel scale and a misclick fallback mechanism. When a click is performed on a non-canvas element and no snapshot change is detected, indicating an incorrect click position, the system invokes the original get_som_screenshot element coordinate function to retrieve the five closest DOM elements to the click location and provides them to the agent for position adjustment.
Experiments show that without the screenshot pixel scale, GPT-5.2 and Gemini-2.5-Pro often fail to click the Google search box reliably. With the pixel scale enabled, the success rate approaches 100%.
The misclick fallback mechanism is also critical. Using only a screenshot of the clicked position as error feedback is ineffective and typically requires multiple iterations for correction. In contrast, identifying the nearest DOM elements to the click position significantly improves fallback efficiency and greatly increases the success rate.
Checklist
Go over all the following points, and put an
xin all the boxes that apply.Fixes #issue-numberin the PR description (required)pyproject.tomlanduv lockIf you are unsure about any of these, don't hesitate to ask. We are here to help!