Skip to content

Conversation

@LIHUA919
Copy link
Collaborator

Summary

This PR implements Phase 1 of the pixel interaction mode feature (Issue #3680), adding foundational support for vision-based browser automation using pixel coordinates.

Phase 1 Changes:

  • Added BrowserInteractionMode enum with three modes: DEFAULT, FULL_VISION, PIXEL_INTERACTION
  • Extended BrowserConfig with interaction_mode field (defaults to DEFAULT for backward compatibility)
  • Updated ConfigLoader.from_kwargs() to accept interactionMode parameter with validation
  • Full backward compatibility - existing code continues to work without changes

Test Results

19/19 unit tests passed - test_pixel_interaction_mode.py
11/11 backward compatibility tests passed - verify_phase1_backward_compat.py
27/27 existing tests passed - test_hybrid_browser_toolkit.py (no regression)

Files Modified

  • camel/types/enums.py: Added BrowserInteractionMode enum
  • camel/toolkits/hybrid_browser_toolkit/config_loader.py: Added interaction_mode support

Design

See design document: docs/design/pixel_interaction_mode.md

Next Steps (Phase 2)

Phase 2 will integrate ToolGenerator to provide dynamic schema generation based on interaction mode, enabling pixel-only operations for vision models.


🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 [email protected]

Add BrowserInteractionMode enum and interaction_mode configuration
to support pixel-based browser interaction for vision models.

This is Phase 1 (Foundation) of pixel interaction mode implementation.
Phase 2 will add the ToolGenerator and integration.

Changes:
- Add BrowserInteractionMode enum (DEFAULT, FULL_VISION, PIXEL_INTERACTION)
- Update BrowserConfig to include interaction_mode field with DEFAULT value
- Add interactionMode parameter handling in ConfigLoader.from_kwargs()
- Supports both string and enum values for interaction_mode parameter
- Includes validation with helpful error messages

Related issue: camel-ai#3680
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LIHUA919 LIHUA919 force-pushed the feat/pixel-interaction-phase1 branch from 726ae95 to 0e082b2 Compare January 18, 2026 03:42
…lGenerator

Phase 2 implements dynamic tool schema generation based on interaction mode.

## Key Changes

### 1. ToolGenerator Class (NEW)
- generate_click_tool(): Mode-specific schemas
  * PIXEL_INTERACTION: x, y parameters
  * DEFAULT/FULL_VISION: ref parameter
- generate_type_tool(): Mode-specific schemas
  * PIXEL_INTERACTION: x, y, text parameters
  * DEFAULT/FULL_VISION: ref, text parameters
- generate_screenshot_tool(): Mode-specific include_labels defaults

### 2. Method Signature Updates
- browser_click: Supports ref, x, y, reason parameters
  * Routes to DOM/pixel based on provided params
  * FR3: Only returns snapshot in DEFAULT mode
- browser_type: Supports ref, text, inputs, x, y, reason
  * Routes to DOM/pixel based on provided params
  * FR3: Only returns snapshot in DEFAULT mode

### 3. Dynamic Schema Generation in get_tools()
- Uses ToolGenerator to create mode-aware OpenAI function schemas
- Agent sees different parameters based on interaction_mode:
  * PIXEL_INTERACTION: Only x, y (no ref)
  * DEFAULT: Only ref (no x, y)

### 4. FR3: No Snapshot Behavior
- FULL_VISION mode: No snapshots after operations
- PIXEL_INTERACTION mode: No snapshots after operations
- DEFAULT mode: Returns snapshots (backward compatible)

## Design Document Compliance
✅ FR1: New interaction_mode parameter
✅ FR2: Pixel-only operations with dynamic schemas
✅ FR3: No snapshot behavior
✅ FR4: Dynamic docstrings via schema generation
✅ FR5: Clean screenshots (include_labels defaults to False)

## Verification
✅ All syntax checks passed
✅ Logic verification completed
✅ Design document compliance confirmed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@LIHUA919 LIHUA919 changed the title feat(pixel-interaction): Phase 1 - Add BrowserInteractionMode enum and config support feat(pixel-interaction): Add BrowserInteractionMode enum and config support Jan 18, 2026
This commit completes the pixel interaction mode feature by adding
TypeScript backend support and fixing a missing FR3 implementation.

**Phase 3: TypeScript Backend Integration**

1. Clean Screenshots (FR5)
   - Add `includeLabels` parameter to captureOptimized() in som-screenshot-injected.ts
   - When includeLabels=false, skip SOM overlay rendering for pixel mode
   - Default to true for backward compatibility

2. Mode Configuration Passing
   - Add interactionMode to BrowserToolkitConfig in types.ts
   - Update ConfigLoader.to_ws_config() to pass interactionMode.value to TypeScript
   - Enables TypeScript backend to receive mode configuration

3. API Updates
   - getSomScreenshot(includeLabels: boolean = true) in hybrid-browser-toolkit.ts
   - Passes includeLabels through to captureOptimized()

**FR3 Fix: No Snapshot Behavior**

- Add missing FR3 implementation to browser_mouse_control()
- Remove snapshot in PIXEL/FULL_VISION modes (same as browser_click)
- Ensures consistency across all interaction methods

**Integration:**
- Phase 1: BrowserInteractionMode enum + ConfigLoader support
- Phase 2: Dynamic schema generation + FR3 (browser_click)
- Phase 3: TypeScript backend + FR3 fix (browser_mouse_control)

All three phases work together to enable vision-based models to interact
with web pages using pixel coordinates instead of DOM references.

**Verification:**
- TypeScript compilation: ✅
- Runtime configuration: ✅
- Real browser operations: ✅
  - Browser initialization with PIXEL_INTERACTION mode
  - mouse_control(x, y) clicking
  - FR3 snapshot filtering
  - Backward compatibility maintained

**Compliance:**
- Design document: Pixel Interaction Mode Implementation Plan
- Issue: camel-ai#3680
- Backward compatibility: Maintained (defaults to existing behavior)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Comment on lines +20 to +37
class ToolGenerator:
r"""Generates dynamic tool schemas based on browser interaction mode.

This class provides methods to generate OpenAI function call compatible
schemas for browser tools that adapt their parameters based on the
interaction mode (DEFAULT, FULL_VISION, or PIXEL_INTERACTION).

In PIXEL_INTERACTION mode:
- browser_click: Uses (x, y) coordinates instead of ref
- browser_type: Uses (x, y) coordinates instead of ref
- Screenshots default to no labels (include_labels=False)

In DEFAULT/FULL_VISION modes:
- browser_click: Uses ref parameter
- browser_type: Uses ref parameter
- Screenshots default to labels (include_labels=True)
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am not sure why we need this new file? And why specified for OpenAI function call ?

Comment on lines +2075 to +2093
class BrowserInteractionMode(Enum):
r"""Browser interaction mode for computer-use models.

This enum defines different modes for browser automation, particularly
for vision-based models that may prefer pixel-based interactions over
DOM element references.

Attributes:
DEFAULT: Use DOM element references, return snapshots after
operations (SOM-labeled screenshots).
FULL_VISION: Use DOM element references, but don't return snapshots
after operations (only explicit screenshot requests).
PIXEL_INTERACTION: Use pixel coordinates only (no DOM refs), don't
return snapshots after operations. Designed for vision models
that can visually locate UI elements.
"""
DEFAULT = "default"
FULL_VISION = "full_vision"
PIXEL_INTERACTION = "pixel_interaction"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to modify this? It seems modification inside HybridBrowserToolkit can be sufficient?

Comment on lines +14 to +16

"""Unit tests for proxy handling in WebSocketBrowserWrapper."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proxy improvement, please split to new pr

Comment on lines +678 to +685
ref (Optional[str]): The `ref` ID of the element to click.
Used in DEFAULT/FULL_VISION modes.
x (Optional[float]): X-coordinate for the click (in pixels).
Used in PIXEL_INTERACTION mode.
y (Optional[float]): Y-coordinate for the click (in pixels).
Used in PIXEL_INTERACTION mode.
reason (Optional[str]): Reason for clicking at this location.
Used for logging and debugging.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We better change the docstring for pixel interaction to dynamic, in normal mode the agent should not see the ref docstring

@nitpicker55555
Copy link
Collaborator

This pr closed since not updated after 2 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants