This repository contains a complete system for:
- Real-time teleoperation of a bimanual ALOHA robot in the MuJoCo physics simulator using only a standard webcam and bare hands.
- An Imitation Learning pipeline ready for collecting expert demonstrations and training advanced policies like the Action-Chunking Transformer (ACT).
- Intuitive Hand-Tracking Control: Uses Google's MediaPipe to translate your hand movements into precise robot actions.
- High-Fidelity Simulation: Leverages the power of MuJoCo and the bigym framework for realistic physics and complex interaction tasks.
- Robust Client-Server Architecture: Decouples the vision processing (client) from the physics simulation (server) for maximum performance and stability.
- Reproducible & Cross-Platform: The entire simulation environment is containerized using Docker, ensuring it runs identically anywhere.
- AI-Ready: Built from the ground up to serve as a data collection tool for modern imitation learning algorithms.
- Simulation: MuJoCo Physics Engine
- Environment: Bigym & Gymnasium
- Robot Model: ALOHA Bimanual Manipulator
- Hand Tracking: Google MediaPipe
- Inverse Kinematics (IK): mink
- Containerization: Docker on WSL2
The system operates on a client-server model to ensure real-time performance by separating concerns:
+--------------------------+ +--------------------------------+
| HOST (Windows) | | DOCKER CONTAINER |
| | | (Ubuntu + CUDA libs) |
| +-------------------+ | | |
| | Webcam Feed | | | +--------------------------+ |
| +--------+----------+ | | | MuJoCo Physics | |
| | | | | (ALOHA Robot Sim) | |
| v | | +------------+-------------+ |
| +-------------------+ | | ^ |
| | run_client.py | | | | Robot Commands |
| | (MediaPipe Hand | | | | |
| | Tracking) | | | +------------+-------------+ |
| +--------+----------+ | | | control/hand_teleop.py | |
| | Hand Coords | | | (IK Solver & Sim Loop) | |
| +--------------+----->+ +--------------------------+ |
| | UDP | |
| | | |
+--------------------------+ +--------------------------------+
- Git
- Python 3.10+
- Docker Desktop with WSL2 backend enabled.
- An X Server for Windows like VcXsrv to view the simulation GUI from Docker. Remember to check "Disable access control" when launching VcXsrv.
This process builds the container with all necessary robotics libraries and applies required patches.
# Clone this repository
git clone https://github.com/Rahul-Lashkari/aloha-vision.git
cd aloha-vision
# Build the Docker image. This may take several minutes.
docker build -t aloha-server .This creates a local Python environment for the webcam client.
# From the project root directory
# Create and activate a virtual environment
python -m venv .venv
.\.venv\Scripts\activate
# Install required packages
pip install mediapipe opencv-pythonThe system requires two terminals running simultaneously.
Terminal 1: Launch the Server (Docker)
# Start the Docker container with display forwarding and volume mounting
docker run -it --rm -e "DISPLAY=host.docker.internal:0.0" -v "%cd%:/app" -v /app/.venv aloha-server
# Inside the container, activate the environment and run the server script
source .venv/bin/activate
python control/hand_teleop.pyA MuJoCo window showing the robot should appear on your desktop. The server is now listening.
Terminal 2: Launch the Client (Windows)
-
Find the Docker container's IP address. Open a new terminal and run:
wsl -d Docker-Desktop ifconfig
Look for the
inetaddress under theeth0interface (e.g.,172.x.x.x). -
Set the environment variable for the server's IP.
# For Command Prompt: set ALOHA_SERVER_IP=172.x.x.x # For PowerShell: $env:ALOHA_SERVER_IP="172.x.x.x"
(Replace
172.x.x.xwith the actual IP you found). -
Run the client script from your activated local environment:
# Make sure u're in the project root and ur venv is active python run_client.py
A webcam window will open. Your hand movements will now control the robot in the MuJoCo simulation!
- This project stands on the shoulders of giants. It is heavily inspired by the work of AlmondGod on the Nintendo-Aloha project.
- The simulation environment is built upon the excellent bigym framework.
- Inverse Kinematics are solved using the powerful mink library.
@article{chernyadev2024bigym,
title={BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark},
author={Chernyadev, Nikita and Backshall, Nicholas and Ma, Xiao and Lu, Yunfan and Seo, Younggyo and James, Stephen},
journal={arXiv preprint arXiv:2407.07788},
year={2024}
}