An AI-powered desktop automation agent that uses Google's Gemini 2.5 Computer-Use model to autonomously execute Windows/macOS/Linux desktop tasks through natural language commands.
This agent enables autonomous control of your desktop using AI. It continuously takes screenshots, sends them to the Gemini model, and executes desktop actions (clicking, double-clicking, typing text, pressing keys) based on AI decisions.
Platform Support: Works on Windows 11, macOS, and Linux. For macOS/Linux, you need to adjust the SYSTEM_PROMPT in agent.py to match your operating system.
- π€ Autonomous desktop control via Gemini 2.5 Computer-Use
- π±οΈ Mouse and keyboard control with pyautogui
- πΈ Screenshot-based visual analysis
- π Action history for context-aware decisions
- π― Normalized coordinates for different screen resolutions
- π Continuous loop until goal achievement
- Python 3.8 or higher
- Anaconda or Miniconda
- Windows 11, macOS, or Linux
- Google Gemini API Key
git clone https://github.com/NeverBeLazyG/gemini-pc-use.git
cd gemini-pc-useconda create -n computer-use-desktop python=3.10
conda activate computer-use-desktoppip install -r requirements.txtCreate a .env file in the project directory:
GEMINI_API_KEY=your_gemini_api_key_hereYou can create an API key at Google AI Studio.
python main.py "Open notepad and write a letter"# File operations
python main.py "Create a new folder on the desktop named 'Projects'"
# Opening applications
python main.py "Open Chrome browser and go to google.com"
# Complex tasks
python main.py "Open Excel and create a new spreadsheet with sample data"pc-use/
βββ main.py # Main entry point
βββ agent.py # Agent logic and Gemini integration
βββ computer.py # Desktop control functions
βββ logging_config.py # Logging configuration
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (not in repository)
βββ README.md # This file
- Screenshot: The agent takes a screenshot of the current screen
- Analysis: The screenshot is sent to Gemini along with the user's goal
- Decision: The model decides on the next action to execute
- Execution: The action is executed using pyautogui
- Repeat: Steps 1-4 are repeated until the goal is achieved
click_at(x, y): Single mouse click at coordinatesdouble_click_at(x, y): Double-click at coordinatestype_text_at(x, y, text): Type text at positionpress_key(key): Press a key (e.g., 'enter', 'esc')move_mouse(x, y): Move mouse without clickingdone(): Mark task as completed
- Caution: The agent has full control over mouse and keyboard
- Make sure not to interfere during execution
- The agent uses normalized coordinates (0-1000)
- Screenshots are saved as
debug_screenshot.png - All actions are logged in
computer-use.log
If you're using macOS or Linux, you must modify the SYSTEM_PROMPT in agent.py to match your operating system:
# For macOS
SYSTEM_PROMPT = """
You are operating a macOS desktop. Your task is to achieve the goal specified by the user by executing a series of actions.
...
"""
# For Linux
SYSTEM_PROMPT = """
You are operating a Linux desktop. Your task is to achieve the goal specified by the user by executing a series of actions.
...
"""The system prompt in agent.py can be customized to change the agent's behavior:
SYSTEM_PROMPT = """
Your custom instructions here...
"""In agent.py, you can change the wait time between actions:
time.sleep(5) # Wait time in seconds- Check saved screenshots in
debug_screenshot.png - Review logs in
computer-use.log
- Ensure your GEMINI_API_KEY is valid
- Check your internet connection
- Verify pyautogui is correctly installed
- Disable other mouse/keyboard programs
This project is licensed under the MIT License. See LICENSE for details.
Contributions are welcome! Please create a pull request or open an issue for suggestions.
See CONTRIBUTING.md for detailed guidelines.
This tool gives the AI model full control over your desktop. Use at your own risk. The author assumes no liability for damages arising from the use of this tool.
For questions or issues, please open an issue on GitHub.
Note: This project uses the experimental Gemini 2.5 Computer-Use API and may change when Google releases updates.