-
Notifications
You must be signed in to change notification settings - Fork 1.6k
add support for gemini 2.5 cua #3697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| """Conversation state for Gemini Computer Use sessions.""" | ||
|
|
||
| contents: list[Content] | ||
| last_response: GenerateContentResponse | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anthropic / OpenAI CUA didn't need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They probably do - we should make this more general if possible
| from skyvern.webeye.actions.responses import ActionResult | ||
| from skyvern.webeye.scraper.scraper import ScrapedPage | ||
|
|
||
| ComputerUseState: TypeAlias = OpenAIResponse | GeminiComputerUseState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: naming should be consistent (either OpenAIComputerUseState or GeminiComputerUseResponse
TBH: Why can't the response object be shared?
wintonzheng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to update code under skyvern/client. we use a code generator tool to generate the python client code here
skyvern/webeye/actions/actions.py
Outdated
| action_type: ActionType = ActionType.RELOAD_PAGE | ||
|
|
||
|
|
||
| class NavigateAction(Action): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"navigate" is a confusing terminology when it comes to browser automation.
it can mean navigating to a url and it can also mean navigating to a page through website actions like clicking a button.
We try to avoid the "navigate" term inside skyvern. For example we call it "GOTO_URL" block.
As a result, can we call this "GotoUrlAction"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
🤖 This PR adds support for Google's Gemini 2.5 Computer Use model as a new engine option (
gemini-cua) in Skyvern, enabling users to leverage Google's latest computer vision and automation capabilities alongside existing OpenAI and Anthropic CUA engines. The implementation includes comprehensive integration across the entire stack from API definitions to frontend UI components.🔍 Detailed Analysis
Key Changes
gemini-cuaas a newRunEngineandRunTypethroughout the codebase, including API schemas, database models, and client typesgoogle-genailibrary (v1.43.0) with proper client initialization and configuration usingGEMINI_CUA_MODELsettingparse_gemini_cua_actions()that handles computer use function calls and converts them to Skyvern actionsGeminiComputerUseStateclass to maintain conversation history and function call context across agent stepsNAVIGATE,GO_BACK,GO_FORWARDactions and corresponding handlers for browser navigationTechnical Implementation
sequenceDiagram participant User participant API participant Agent participant Gemini participant Browser User->>API: Create task with gemini-cua engine API->>Agent: Initialize with GeminiComputerUseState Agent->>Gemini: Send screenshot + conversation history Gemini->>Agent: Return function calls (click_at, type_text_at, etc.) Agent->>Agent: Parse function calls to Skyvern actions Agent->>Browser: Execute actions (click, type, navigate) Browser->>Agent: Return results + new screenshot Agent->>Gemini: Continue conversation with resultsImpact
Created with Palmier