feat(pixel-interaction): Add BrowserInteractionMode enum and config support #3708

LIHUA919 · 2026-01-18T02:52:11Z

Summary

This PR implements Phase 1 of the pixel interaction mode feature (Issue #3680), adding foundational support for vision-based browser automation using pixel coordinates.

Phase 1 Changes:

Added BrowserInteractionMode enum with three modes: DEFAULT, FULL_VISION, PIXEL_INTERACTION
Extended BrowserConfig with interaction_mode field (defaults to DEFAULT for backward compatibility)
Updated ConfigLoader.from_kwargs() to accept interactionMode parameter with validation
Full backward compatibility - existing code continues to work without changes

Test Results

✅ 19/19 unit tests passed - test_pixel_interaction_mode.py
✅ 11/11 backward compatibility tests passed - verify_phase1_backward_compat.py
✅ 27/27 existing tests passed - test_hybrid_browser_toolkit.py (no regression)

Files Modified

camel/types/enums.py: Added BrowserInteractionMode enum
camel/toolkits/hybrid_browser_toolkit/config_loader.py: Added interaction_mode support

Design

See design document: docs/design/pixel_interaction_mode.md

Next Steps (Phase 2)

Phase 2 will integrate ToolGenerator to provide dynamic schema generation based on interaction mode, enabling pixel-only operations for vision models.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 [email protected]

…bridBrowserToolkit Fixes camel-ai#3695

Add BrowserInteractionMode enum and interaction_mode configuration to support pixel-based browser interaction for vision models. This is Phase 1 (Foundation) of pixel interaction mode implementation. Phase 2 will add the ToolGenerator and integration. Changes: - Add BrowserInteractionMode enum (DEFAULT, FULL_VISION, PIXEL_INTERACTION) - Update BrowserConfig to include interaction_mode field with DEFAULT value - Add interactionMode parameter handling in ConfigLoader.from_kwargs() - Supports both string and enum values for interaction_mode parameter - Includes validation with helpful error messages Related issue: camel-ai#3680

coderabbitai · 2026-01-18T02:52:18Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…lGenerator Phase 2 implements dynamic tool schema generation based on interaction mode. ## Key Changes ### 1. ToolGenerator Class (NEW) - generate_click_tool(): Mode-specific schemas * PIXEL_INTERACTION: x, y parameters * DEFAULT/FULL_VISION: ref parameter - generate_type_tool(): Mode-specific schemas * PIXEL_INTERACTION: x, y, text parameters * DEFAULT/FULL_VISION: ref, text parameters - generate_screenshot_tool(): Mode-specific include_labels defaults ### 2. Method Signature Updates - browser_click: Supports ref, x, y, reason parameters * Routes to DOM/pixel based on provided params * FR3: Only returns snapshot in DEFAULT mode - browser_type: Supports ref, text, inputs, x, y, reason * Routes to DOM/pixel based on provided params * FR3: Only returns snapshot in DEFAULT mode ### 3. Dynamic Schema Generation in get_tools() - Uses ToolGenerator to create mode-aware OpenAI function schemas - Agent sees different parameters based on interaction_mode: * PIXEL_INTERACTION: Only x, y (no ref) * DEFAULT: Only ref (no x, y) ### 4. FR3: No Snapshot Behavior - FULL_VISION mode: No snapshots after operations - PIXEL_INTERACTION mode: No snapshots after operations - DEFAULT mode: Returns snapshots (backward compatible) ## Design Document Compliance ✅ FR1: New interaction_mode parameter ✅ FR2: Pixel-only operations with dynamic schemas ✅ FR3: No snapshot behavior ✅ FR4: Dynamic docstrings via schema generation ✅ FR5: Clean screenshots (include_labels defaults to False) ## Verification ✅ All syntax checks passed ✅ Logic verification completed ✅ Design document compliance confirmed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

This commit completes the pixel interaction mode feature by adding TypeScript backend support and fixing a missing FR3 implementation. **Phase 3: TypeScript Backend Integration** 1. Clean Screenshots (FR5) - Add `includeLabels` parameter to captureOptimized() in som-screenshot-injected.ts - When includeLabels=false, skip SOM overlay rendering for pixel mode - Default to true for backward compatibility 2. Mode Configuration Passing - Add interactionMode to BrowserToolkitConfig in types.ts - Update ConfigLoader.to_ws_config() to pass interactionMode.value to TypeScript - Enables TypeScript backend to receive mode configuration 3. API Updates - getSomScreenshot(includeLabels: boolean = true) in hybrid-browser-toolkit.ts - Passes includeLabels through to captureOptimized() **FR3 Fix: No Snapshot Behavior** - Add missing FR3 implementation to browser_mouse_control() - Remove snapshot in PIXEL/FULL_VISION modes (same as browser_click) - Ensures consistency across all interaction methods **Integration:** - Phase 1: BrowserInteractionMode enum + ConfigLoader support - Phase 2: Dynamic schema generation + FR3 (browser_click) - Phase 3: TypeScript backend + FR3 fix (browser_mouse_control) All three phases work together to enable vision-based models to interact with web pages using pixel coordinates instead of DOM references. **Verification:** - TypeScript compilation: ✅ - Runtime configuration: ✅ - Real browser operations: ✅ - Browser initialization with PIXEL_INTERACTION mode - mouse_control(x, y) clicking - FR3 snapshot filtering - Backward compatibility maintained **Compliance:** - Design document: Pixel Interaction Mode Implementation Plan - Issue: camel-ai#3680 - Backward compatibility: Maintained (defaults to existing behavior) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

nitpicker55555 · 2026-01-19T13:32:47Z

camel/toolkits/hybrid_browser_toolkit/tool_generator.py

+class ToolGenerator:
+    r"""Generates dynamic tool schemas based on browser interaction mode.
+
+    This class provides methods to generate OpenAI function call compatible
+    schemas for browser tools that adapt their parameters based on the
+    interaction mode (DEFAULT, FULL_VISION, or PIXEL_INTERACTION).
+
+    In PIXEL_INTERACTION mode:
+    - browser_click: Uses (x, y) coordinates instead of ref
+    - browser_type: Uses (x, y) coordinates instead of ref
+    - Screenshots default to no labels (include_labels=False)
+
+    In DEFAULT/FULL_VISION modes:
+    - browser_click: Uses ref parameter
+    - browser_type: Uses ref parameter
+    - Screenshots default to labels (include_labels=True)
+    """
+


Hi, I am not sure why we need this new file? And why specified for OpenAI function call ?

nitpicker55555 · 2026-01-19T13:33:54Z

camel/types/enums.py

+class BrowserInteractionMode(Enum):
+    r"""Browser interaction mode for computer-use models.
+
+    This enum defines different modes for browser automation, particularly
+    for vision-based models that may prefer pixel-based interactions over
+    DOM element references.
+
+    Attributes:
+        DEFAULT: Use DOM element references, return snapshots after
+            operations (SOM-labeled screenshots).
+        FULL_VISION: Use DOM element references, but don't return snapshots
+            after operations (only explicit screenshot requests).
+        PIXEL_INTERACTION: Use pixel coordinates only (no DOM refs), don't
+            return snapshots after operations. Designed for vision models
+            that can visually locate UI elements.
+    """
+    DEFAULT = "default"
+    FULL_VISION = "full_vision"
+    PIXEL_INTERACTION = "pixel_interaction"


Why we need to modify this? It seems modification inside HybridBrowserToolkit can be sufficient?

nitpicker55555 · 2026-01-19T13:34:22Z

test/toolkits/test_proxy_handling.py

+
+"""Unit tests for proxy handling in WebSocketBrowserWrapper."""
+


For proxy improvement, please split to new pr

nitpicker55555 · 2026-01-19T13:36:15Z

camel/toolkits/hybrid_browser_toolkit/hybrid_browser_toolkit_ts.py

+            ref (Optional[str]): The `ref` ID of the element to click.
+                Used in DEFAULT/FULL_VISION modes.
+            x (Optional[float]): X-coordinate for the click (in pixels).
+                Used in PIXEL_INTERACTION mode.
+            y (Optional[float]): Y-coordinate for the click (in pixels).
+                Used in PIXEL_INTERACTION mode.
+            reason (Optional[str]): Reason for clicking at this location.
+                Used for logging and debugging.


We better change the docstring for pixel interaction to dynamic, in normal mode the agent should not see the ref docstring

nitpicker55555 · 2026-02-01T15:58:48Z

This pr closed since not updated after 2 weeks

LIHUA919 added 2 commits January 18, 2026 01:33

fix: prevent proxies from causing WebSocket connection failures in Hy…

1aac8be

…bridBrowserToolkit Fixes camel-ai#3695

LIHUA919 force-pushed the feat/pixel-interaction-phase1 branch from 726ae95 to 0e082b2 Compare January 18, 2026 03:42

LIHUA919 changed the title ~~feat(pixel-interaction): Phase 1 - Add BrowserInteractionMode enum and config support~~ feat(pixel-interaction): Add BrowserInteractionMode enum and config support Jan 18, 2026

nitpicker55555 requested changes Jan 19, 2026

View reviewed changes

waleedalzarooni requested review from LuoPengcheng12138 and YunfeiZHAO January 20, 2026 06:16

nitpicker55555 closed this Feb 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pixel-interaction): Add BrowserInteractionMode enum and config support #3708

feat(pixel-interaction): Add BrowserInteractionMode enum and config support #3708

Uh oh!

LIHUA919 commented Jan 18, 2026

Uh oh!

coderabbitai bot commented Jan 18, 2026

Review skipped

Uh oh!

nitpicker55555 Jan 19, 2026

Uh oh!

nitpicker55555 Jan 19, 2026

Uh oh!

nitpicker55555 Jan 19, 2026

Uh oh!

nitpicker55555 Jan 19, 2026

Uh oh!

nitpicker55555 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		"""Unit tests for proxy handling in WebSocketBrowserWrapper."""

feat(pixel-interaction): Add BrowserInteractionMode enum and config support #3708

feat(pixel-interaction): Add BrowserInteractionMode enum and config support #3708

Uh oh!

Conversation

LIHUA919 commented Jan 18, 2026

Summary

Test Results

Files Modified

Design

Next Steps (Phase 2)

Uh oh!

coderabbitai bot commented Jan 18, 2026

Review skipped

Uh oh!

nitpicker55555 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

nitpicker55555 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

nitpicker55555 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

nitpicker55555 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

nitpicker55555 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants