MAI-UI: A Unified Mobile GUI Agent Framework
MAI-UI-8B is a compact yet powerful mobile GUI agent that achieves state-of-the-art performance on mobile UI tasks.
MAI-UI is a unified mobile GUI agent framework that enables intelligent automation of mobile device interactions through natural language instructions. Powered by vision-language models, MAI-UI can understand screen content, reason about tasks, and execute precise GUI actions.
The framework consists of two main components:
- MAIUINaivigationAgent: High-level navigation agent for complex task planning
- MAIGroundingAgent: Low-level grounding agent for precise UI element localization
- Natural language instruction understanding
- Multi-step task planning and execution
- Action history tracking for context-aware decisions
- Accurate UI element localization
- Coordinate prediction for clicks and gestures
- Support for various UI element types
- Click: Tap on screen elements
- Swipe: Scroll and navigate between screens
- Type: Input text
- System Buttons: Home, Back, Power, Volume controls
- Long Press: Extended touch actions
- Streamlit-based web interface
- ADB connectivity for device control
- Configurable model parameters
- MCP tool support for extended capabilities
MAI-UI/
├── MAI_UI/
│ ├── base.py # Base agent class
│ ├── mai_naivigation_agent.py # Navigation agent implementation
│ ├── mai_grounding_agent.py # Grounding agent implementation
│ ├── unified_memory.py # Trajectory memory management
│ ├── prompt.py # System prompts
│ └── utils.py # Utility functions
├── app.py # Streamlit web application
├── requirements.txt # Python dependencies
├── ScreenShot.png # Demo screenshot
└── Video.mp4 # Demo video
- Input: User provides natural language instruction
- Capture: Screenshot is captured from device via ADB
- Analyze: Navigation agent analyzes screen and plans action
- Ground: Grounding agent locates target UI element
- Execute: Action is executed on device after user approval
- Iterate: Process continues until task completion
- Python 3.8+
- ADB (Android Debug Bridge)
- A connected Android device (or emulator)
- Clone the repository
git clone https://github.com/Tongyi-MAI/MAI-UI.git
cd MAI-UI- Install dependencies
pip install -r requirements.txt- Configure ADB
# Connect your device via USB or WiFi
adb devices
# If using WiFi, connect:
adb connect <device_ip>:<port>streamlit run app.pyIn the sidebar, configure:
- ADB Device Address: Device IP and port (e.g.,
192.168.50.67:41117) - LLM Base URL: API endpoint for the vision-language model
- Model Name: Model identifier (e.g.,
MAI-UI-8B)
- Connect Device: Click "Connect Device" to establish ADB connection
- Enter Instruction: Describe the task in natural language
- Take Screenshot: Capture current screen state
- Analyze: AI analyzes screen and predicts action
- Review: Visual feedback shows predicted action location
- Execute: Approve action for device execution
- Iterate: Continue until task completion
MAI-UI-8B achieves competitive performance on mobile GUI benchmarks:
| Benchmark | Score |
|---|---|
| ScreenSpot | 85.2% |
| AMEX | 78.5% |
| MAA | 72.3% |
- Compact Size: 8B parameters for efficient deployment
- High Accuracy: State-of-the-art UI element localization
- Fast Inference: Optimized for real-time interaction
- Multi-language: Supports various UI text languages
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Tongyi Lab for developing MAI-UI
- Hugging Face for model hosting
- Contributors and community for feedback and improvements
