Skip to content

Conversation

@gswangg
Copy link
Contributor

@gswangg gswangg commented Oct 13, 2025


🤖 This PR adds support for Google's Gemini 2.5 Computer Use model as a new engine option (gemini-cua) in Skyvern, enabling users to leverage Google's latest computer vision and automation capabilities alongside existing OpenAI and Anthropic CUA engines. The implementation includes comprehensive integration across the entire stack from API definitions to frontend UI components.

🔍 Detailed Analysis

Key Changes

  • New Engine Integration: Added gemini-cua as a new RunEngine and RunType throughout the codebase, including API schemas, database models, and client types
  • Gemini Client Setup: Integrated Google's google-genai library (v1.43.0) with proper client initialization and configuration using GEMINI_CUA_MODEL setting
  • Action Parsing System: Implemented comprehensive Gemini-specific action parsing in parse_gemini_cua_actions() that handles computer use function calls and converts them to Skyvern actions
  • Computer Use State Management: Created GeminiComputerUseState class to maintain conversation history and function call context across agent steps
  • Frontend Support: Added Gemini CUA option to the engine selector UI component for user selection
  • New Action Types: Extended action system with NAVIGATE, GO_BACK, GO_FORWARD actions and corresponding handlers for browser navigation

Technical Implementation

sequenceDiagram
    participant User
    participant API
    participant Agent
    participant Gemini
    participant Browser
    
    User->>API: Create task with gemini-cua engine
    API->>Agent: Initialize with GeminiComputerUseState
    Agent->>Gemini: Send screenshot + conversation history
    Gemini->>Agent: Return function calls (click_at, type_text_at, etc.)
    Agent->>Agent: Parse function calls to Skyvern actions
    Agent->>Browser: Execute actions (click, type, navigate)
    Browser->>Agent: Return results + new screenshot
    Agent->>Gemini: Continue conversation with results
Loading

Impact

  • Enhanced Model Options: Users can now choose from OpenAI, Anthropic, or Google's computer use models based on their specific needs and preferences
  • Improved Browser Navigation: New navigation actions (navigate, go_back, go_forward) provide better browser control capabilities
  • Robust Action Mapping: Comprehensive mapping from Gemini's computer use functions to Skyvern's action system ensures reliable automation
  • Scalable Architecture: The implementation follows existing patterns, making it easy to add future computer use models
  • Dependency Updates: Upgraded websockets library to support newer versions and added Google GenAI dependency

Created with Palmier

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

"""Conversation state for Gemini Computer Use sessions."""

contents: list[Content]
last_response: GenerateContentResponse | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anthropic / OpenAI CUA didn't need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably do - we should make this more general if possible

from skyvern.webeye.actions.responses import ActionResult
from skyvern.webeye.scraper.scraper import ScrapedPage

ComputerUseState: TypeAlias = OpenAIResponse | GeminiComputerUseState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming should be consistent (either OpenAIComputerUseState or GeminiComputerUseResponse

TBH: Why can't the response object be shared?

Copy link
Contributor

@wintonzheng wintonzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to update code under skyvern/client. we use a code generator tool to generate the python client code here

action_type: ActionType = ActionType.RELOAD_PAGE


class NavigateAction(Action):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"navigate" is a confusing terminology when it comes to browser automation.

it can mean navigating to a url and it can also mean navigating to a page through website actions like clicking a button.

We try to avoid the "navigate" term inside skyvern. For example we call it "GOTO_URL" block.

As a result, can we call this "GotoUrlAction"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants