add support for gemini 2.5 cua #3697

gswangg · 2025-10-13T02:33:00Z

🤖 This PR adds support for Google's Gemini 2.5 Computer Use model as a new engine option (gemini-cua) in Skyvern, enabling users to leverage Google's latest computer vision and automation capabilities alongside existing OpenAI and Anthropic CUA engines. The implementation includes comprehensive integration across the entire stack from API definitions to frontend UI components.

🔍 Detailed Analysis

Key Changes

New Engine Integration: Added gemini-cua as a new RunEngine and RunType throughout the codebase, including API schemas, database models, and client types
Gemini Client Setup: Integrated Google's google-genai library (v1.43.0) with proper client initialization and configuration using GEMINI_CUA_MODEL setting
Action Parsing System: Implemented comprehensive Gemini-specific action parsing in parse_gemini_cua_actions() that handles computer use function calls and converts them to Skyvern actions
Computer Use State Management: Created GeminiComputerUseState class to maintain conversation history and function call context across agent steps
Frontend Support: Added Gemini CUA option to the engine selector UI component for user selection
New Action Types: Extended action system with NAVIGATE, GO_BACK, GO_FORWARD actions and corresponding handlers for browser navigation

Technical Implementation

sequenceDiagram
    participant User
    participant API
    participant Agent
    participant Gemini
    participant Browser
    
    User->>API: Create task with gemini-cua engine
    API->>Agent: Initialize with GeminiComputerUseState
    Agent->>Gemini: Send screenshot + conversation history
    Gemini->>Agent: Return function calls (click_at, type_text_at, etc.)
    Agent->>Agent: Parse function calls to Skyvern actions
    Agent->>Browser: Execute actions (click, type, navigate)
    Browser->>Agent: Return results + new screenshot
    Agent->>Gemini: Continue conversation with results

Impact

Enhanced Model Options: Users can now choose from OpenAI, Anthropic, or Google's computer use models based on their specific needs and preferences
Improved Browser Navigation: New navigation actions (navigate, go_back, go_forward) provide better browser control capabilities
Robust Action Mapping: Comprehensive mapping from Gemini's computer use functions to Skyvern's action system ensures reliable automation
Scalable Architecture: The implementation follows existing patterns, making it easy to add future computer use models
Dependency Updates: Upgraded websockets library to support newer versions and added Google GenAI dependency

Created with Palmier

coderabbitai · 2025-10-13T02:33:35Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

skyvern/core/script_generations/generate_script.py

suchintan · 2025-10-13T02:41:57Z

skyvern/forge/computer_use/state.py

+    """Conversation state for Gemini Computer Use sessions."""
+
+    contents: list[Content]
+    last_response: GenerateContentResponse | None = None


Anthropic / OpenAI CUA didn't need this?

They probably do - we should make this more general if possible

suchintan · 2025-10-13T02:44:03Z

skyvern/webeye/actions/models.py

 from skyvern.webeye.actions.responses import ActionResult
 from skyvern.webeye.scraper.scraper import ScrapedPage

+ComputerUseState: TypeAlias = OpenAIResponse | GeminiComputerUseState


nit: naming should be consistent (either OpenAIComputerUseState or GeminiComputerUseResponse

TBH: Why can't the response object be shared?

wintonzheng

no need to update code under skyvern/client. we use a code generator tool to generate the python client code here

wintonzheng · 2025-10-13T19:39:26Z

skyvern/webeye/actions/actions.py

    action_type: ActionType = ActionType.RELOAD_PAGE


+class NavigateAction(Action):


"navigate" is a confusing terminology when it comes to browser automation.

it can mean navigating to a url and it can also mean navigating to a page through website actions like clicking a button.

We try to avoid the "navigate" term inside skyvern. For example we call it "GOTO_URL" block.

As a result, can we call this "GotoUrlAction"?

gswangg added 10 commits October 12, 2025 19:30

upgrade websockets and install google-genai

cfba3b0

first pass at gemini-2.5-cua implementation

3124782

update docs

f5114d7

remove arbitrary retry loop in gemini cua

a70fea6

generate_cua_actions -> generate_openai_cua_actions

f73bbcc

use async client instead of executor thread offloading

00223c5

add gemini cua engine to frontend engine selector

2dd1a0d

fix some basic bugs

6397f05

add gemini actions to script generator

3aeb4e6

move action funcs down

ea99305

suchintan reviewed Oct 13, 2025

View reviewed changes

skyvern/core/script_generations/generate_script.py Outdated Show resolved Hide resolved

suchintan reviewed Oct 13, 2025

View reviewed changes

wintonzheng reviewed Oct 13, 2025

View reviewed changes

gswangg added 4 commits October 17, 2025 21:32

revert changes to code under client/

38b1db9

rename navigate -> goto url

a93ab26

Merge remote-tracking branch 'upstream/main' into feature/gemini-2.5-cua

8f827f9

add log for gemini 2.5 cua

3626021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add support for gemini 2.5 cua #3697

add support for gemini 2.5 cua #3697

gswangg commented Oct 13, 2025 •

edited by palmier-app bot

Loading

Uh oh!

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

suchintan Oct 13, 2025

Uh oh!

suchintan Oct 13, 2025

Uh oh!

suchintan Oct 13, 2025

Uh oh!

wintonzheng left a comment

Uh oh!

wintonzheng Oct 13, 2025

Uh oh!

gswangg Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		action_type: ActionType = ActionType.RELOAD_PAGE


		class NavigateAction(Action):

add support for gemini 2.5 cua #3697

Are you sure you want to change the base?

add support for gemini 2.5 cua #3697

Conversation

gswangg commented Oct 13, 2025 • edited by palmier-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Technical Implementation

Impact

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

suchintan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

suchintan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

suchintan Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

wintonzheng left a comment

Choose a reason for hiding this comment

Uh oh!

wintonzheng Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gswangg Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gswangg commented Oct 13, 2025 •

edited by palmier-app bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading