-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Required prerequisites
- I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- Consider asking first in a Discussion.
Motivation
Refer to:
- https://github.com/camel-ai/crab and other computer-use or mobile-use projects.
- https://github.com/camel-ai/camel/blob/master/camel/runtimes/ubuntu_docker_runtime.py
Summary
Add full Ubuntu and Android execution environments—including toolkits, VMs/emulators, sandboxes, action/observation spaces, and orchestration—so CAMEL agents can operate across desktop and mobile ecosystems through GUI, API, or hybrid interaction modes.
This upgrade will enable richer multi-device autonomy and real-world agent behaviors.
Notes: some of the features are supported in https://github.com/camel-ai/camel/blob/master/camel/runtimes/ubuntu_docker_runtime.py and https://github.com/camel-ai/crab
🎯 Motivation
To build general autonomous agents, CAMEL needs platform-level environments where agents can interact with real operating systems, software, GUIs, and devices.
Right now, CAMEL lacks:
- OS-specific toolkits for Ubuntu and Android
- Full execution sandboxes
- Standardized action/observation spaces for GUI, API, or hybrid control
- Runtime orchestration across heterogeneous platforms
- VNC/noVNC-based graphical access for agent visualization and debugging
Adding Ubuntu and Android support unlocks significant new research and practical applications.
📦 Proposed Additions
1. Ubuntu Toolkit + Execution Environment
Ubuntu Toolkit Capabilities
- Standardized command execution
- File system operations (read/write/search)
- GUI automation (via pyautogui, X11, Wayland, or browser-based toolkit)
- Package management (APT)
- Networking tools
- Optional agent MCP integrations
- Execution of preinstalled software
Ubuntu VM / Sandbox / Runtime
-
Based on Ubuntu 22.04 LTS
-
Runs either as:
- VM
- Docker sandbox
- Agent-safe runtime
-
With optional GUI stack using:
- VNC server
- noVNC browser-based access
Preinstalled Ubuntu Software
Potential defaults:
- Python, Node, Java toolchains
- Browsers (Firefox / Chromium)
- Developer tools (git, curl, build-essential)
- Automation packages (xdotool, wmctrl)
- Optional AI/ML toolchains
- Any desired MCPs or agent toolkits
2. Android Toolkit + Execution Environment
Android Toolkit Capabilities
-
ADB commands
-
App installation/removal
-
Input simulation: tap, swipe, long press
-
Typing/text events
-
Screenshot + screen recording
-
UI hierarchy extraction
-
Intent launching and permission control
-
Optional UI automation via:
uiautomator2Appiumespresso(advanced)
Android Execution Environment
-
Android Emulator (x86/ARM)
-
Sandbox with ADB bridge
-
Optional GUI access via:
- VNC server in emulator
- noVNC in browser
-
Configurable:
- Android version
- Screen resolution
- Device profile
- Preinstalled apps
3. Action and Observation Spaces (GUI / API / Hybrid)
Action Spaces
Agents should be able to choose actions across different modalities:
GUI-Based Actions
- Mouse movement/click
- Keyboard events
- Touch gestures (Android)
- Window focus / switching
API-Based Actions
- System commands
- API calls exposed by toolkits
- ADB commands
- High-level task actions (e.g., “open browser”, “install package”)
Hybrid Actions
- GUI fallback when API fails
- API introspection + GUI execution
- Multi-step execution chains across OS boundaries
Observation Spaces
- Full-screen screenshots
- Bounding-box detected UI elements
- OCR text extraction
- System logs
- Output of terminal/ADB commands
- Telemetry (CPU, RAM, network)
- File system state
4. Orchestration Layer for Runtimes and Emulators
Centralized orchestration for:
- Managing Ubuntu VMs/sandboxes
- Launching Android emulators
- Starting/stopping runtimes
- Maintaining lifecycle of multiple environments
- Synchronizing agent interactions
- Logging, replay, and deterministic stepping
Possible orchestrator modes:
- Local multi-runtime
- Cluster/distributed runtimes
- Dockerized
- CI-friendly headless mode
5. Integration of VNC / noVNC
Why
To give agents and developers GUI visibility.
What to integrate
- VNC servers for Ubuntu GUI
- VNC embedded in Android Emulator
- noVNC to expose GUI in browser
- Agent-accessible screenshot + OCR utilities
- Ability to switch between screen rendering modes
Development + Debugging
- Human-in-the-loop control
- Replay and time-travel debugging
- Parallel monitoring of multiple device screens
🔒 Security Model
- Container-level sandboxing
- Command whitelisting
- File-system isolation
- Resource limits (CPU, RAM, GPU)
- Network policies
- Debug mode vs. locked-down mode
- Survives untrusted agent actions
🧩 Integration Points with CAMEL
- Standard tool interface for both Ubuntu and Android
- Unified API for actions/observations
- Compatibility with current multi-agent workflows
- Optional plugin system for additional OS/toolkits
- Shared schema for tool results
📈 Expected Impact
- Agents can operate full computers and mobile devices
- Enables experiments in GUI agents, multimodal autonomy, tool learning, device automation, and multi-device coordination
- Bridges the gap between purely text-based agents and real-world embodied software agents
- Supports the long-term goal of building universal, general-purpose agents
🙋 Request for Feedback
Seeking input on:
- Environment packaging (VMs vs. containers vs. hybrid)
- What should be included by default in Ubuntu/Android runtimes
- Standardization of GUI/API action schemas
- Orchestrator design
- Security model and execution safety boundaries
- How to align this with future MCP/toolkit ecosystems
👉 Write scaffolding code for Ubuntu/Android tool wrappers
👉 Design the action/observation schema
👉 Architect the multi-runtime orchestrator
👉 Prepare a companion PR
👉 Build a roadmap or RFC for the whole system
Just tell me!
Solution
No response
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status