Skip to content

Add WindowsApiToolsMixin for native Windows API tools (bypass MCP server overhead) #332

@itomek

Description

@itomek

Summary

Add a WindowsApiToolsMixin class that provides native Windows API tools directly to GAIA agents, eliminating the need to route Windows system operations through an external MCP server subprocess. This mixin would follow the existing tool mixin pattern established by CLIToolsMixin (src/gaia/agents/code/tools/cli_tools.py) and integrate with the @tool decorator in src/gaia/agents/base/tools.py.

Motivation

Currently, Windows system operations (GUI automation, window management, system info, clipboard, etc.) require launching an external MCP server process (e.g., uvx windows-mcp) and communicating through MCPClientMixin (src/gaia/mcp/mixin.py). This works but introduces measurable overhead:

  • Process startup cost: Spawning a subprocess for the MCP server
  • IPC serialization: Every tool call crosses a process boundary via JSON-RPC over stdio
  • Connection lifecycle: Initialize handshake, tool discovery, and teardown per session

A native mixin that calls Windows APIs directly from the agent process eliminates all of this. The tools register into _TOOL_REGISTRY the same way any other tool does, and the LLM can invoke them with zero IPC overhead.

The MCP server approach remains valuable for cross-machine scenarios and language-agnostic integrations, but for local single-machine usage the native mixin should be the faster default.

Proposed Location

src/gaia/agents/windows/
    __init__.py
    windows_api_tools.py      # WindowsApiToolsMixin class

This mirrors the pattern where CLIToolsMixin lives in src/gaia/agents/code/tools/cli_tools.py.

Tools to Implement

The mixin should expose at minimum the following tool categories, registered via the @tool decorator from src/gaia/agents/base/tools.py:

Window Management

  • list_windows() - Enumerate visible windows (title, handle, position, size)
  • focus_window(title_or_handle) - Bring a window to the foreground
  • move_window(handle, x, y, width, height) - Reposition/resize a window
  • minimize_window(handle) / maximize_window(handle) / close_window(handle)

System Information

  • get_system_info() - OS version, hostname, CPU, RAM, GPU, NPU presence
  • get_running_processes() - Process list with PID, name, memory usage
  • get_disk_usage() - Drive letters, total/used/free space
  • get_display_info() - Monitor count, resolutions, DPI scaling
  • get_battery_status() - Charge level, AC/battery, estimated time remaining

Clipboard Operations

  • get_clipboard() - Read current clipboard text content
  • set_clipboard(text) - Write text to clipboard

GUI Automation (basic)

  • screenshot(region=None) - Capture full screen or a region, return file path
  • click(x, y) - Simulate mouse click at coordinates
  • type_text(text) - Simulate keyboard input
  • send_keys(keys) - Send key combinations (e.g., Win+D, Alt+Tab)

System Settings

  • get_dark_mode_status() - Check if dark mode is enabled
  • set_dark_mode(enabled) - Toggle dark mode
  • get_volume() / set_volume(level) - Audio volume control

Notifications

  • show_notification(title, message) - Display a Windows toast notification

Acceptance Criteria

  • WindowsApiToolsMixin class created following the mixin pattern in CLIToolsMixin
  • All tools registered via @tool decorator into _TOOL_REGISTRY
  • Mixin is composable with the base Agent class: class MyAgent(Agent, WindowsApiToolsMixin)
  • Platform guard: tools gracefully degrade or raise clear errors on non-Windows platforms
  • No dependency on an external MCP server process for any of the tools listed above
  • Python dependencies are Windows-only extras (e.g., pywin32, pyautogui, pystray) declared in pyproject.toml under a [windows] extra
  • Unit tests with mocked Windows APIs (runnable on any platform)
  • Integration tests gated behind a @pytest.mark.windows marker
  • CUA integration: the Computer Use Agent (docs/plans/cua.mdx) can use WindowsApiToolsMixin as a native backend instead of (or alongside) an MCP server
  • Documentation page added to docs/guides/ or docs/sdk/
  • docs/docs.json updated with the new documentation page

Technical Notes

Pattern to Follow

The CLIToolsMixin in src/gaia/agents/code/tools/cli_tools.py is the closest reference. Key patterns:

  1. Inherits via super().__init__(*args, **kwargs) for MRO compatibility
  2. Has a register_*_tools() method that defines @tool-decorated inner functions
  3. Uses _ensure_*_initialized() for lazy initialization of internal state
  4. Returns structured dicts with status, success, error, data fields

CUA Relationship

The CUA plan (docs/plans/cua.mdx) currently assumes all desktop control goes through an external MCP server. WindowsApiToolsMixin offers a native alternative that the CUA agent could use:

class ComputerUseAgent(Agent, WindowsApiToolsMixin, MCPClientMixin):
    """Uses native tools when available, falls back to MCP server."""

This lets the CUA agent prefer zero-overhead native calls for common operations while still supporting arbitrary MCP servers for extended capabilities.

Compatibility with MCPClientMixin

Both mixins register tools into the same _TOOL_REGISTRY. Care should be taken to:

  • Namespace tools clearly (e.g., win_list_windows vs mcp_windows_list_windows)
  • Allow both mixins on the same agent without name collisions
  • Document which approach to prefer and when

Related Files

  • src/gaia/agents/base/tools.py - @tool decorator and _TOOL_REGISTRY
  • src/gaia/agents/base/agent.py - Base Agent class
  • src/gaia/mcp/mixin.py - MCPClientMixin (current MCP-based approach)
  • src/gaia/agents/code/tools/cli_tools.py - CLIToolsMixin (pattern to follow)
  • docs/plans/cua.mdx - Computer Use Agent roadmap (primary consumer)

Open Questions

  1. Should the mixin auto-register all tools on __init__, or require an explicit register_windows_tools() call (like CLIToolsMixin.register_cli_tools())?
  2. Which Python libraries to standardize on? Candidates: pywin32 (low-level Win32 API), pyautogui (GUI automation), ctypes (no extra deps).
  3. Should screenshot output be a file path, base64 string, or both?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions