Skip to content

Replace individual input tools with a WebDriver-like API #824

@OrKoN

Description

@OrKoN

Is your feature request related to a problem? Please describe.

Currently, we have hover, click and press_key tools which are rather low level and to send multiple of them the client needs to be able to issue multiple tool calls (which might involve multiple LLM round trips). In addition, we support some low level input commands but not all like the perform_actions API.

Describe the solution you'd like

WebDriver defines the perform actions API https://www.w3.org/TR/webdriver-bidi/#command-input-performActions that allows sending a sequence of actions in one tool call. We can replace our existing low level APIs with an API similar to WebDriver (but integrated with our other concepts like uids). We should not replace higher level APIs though, like fill, fill_form, drag.

Describe alternatives you've considered

Current approach.

Additional context

We need to test if the models have any difficulties using the WebDriver like API.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions