Vision-Guided Manipulation with Gemini & Robot Arms

This repository contains a demonstration of Language-Guided Manipulation using the SO-101 robot arm. It utilizes Gemini (gemini-3-flash-preview by default or gemini-robotics-er-1.5-preview) to enable zero-shot detection and pointing. Gemini is used to process images and return the 2D pixel coordinates of objects in the query. Then, the script uses a kinematics module to calculate the joint angles required to reach a target position in 3D space.

1. Prerequisites

Before running the script, set up your environment and install dependencies.

Environment setup (Linux/macOS)

The recommended way to set up your environment is based on the LeRobot installation instructions.

Install Conda (Miniforge recommended):

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Create and Activate Environment:

conda create -y -n gemini-robotics-pointing python=3.10
conda activate gemini-robotics-pointing

Install ffmpeg:
```
conda install ffmpeg -c conda-forge
```
Install Python dependencies:
```
pip install -r requirements.txt
```

2. Conceptual Walkthrough

This script (workshop.py) integrates vision, language, and action through into a single loop. Here is how it works:

1. Initialization

Robot: Connects to the SO-101 arm via serial port.
Camera: Opens the USB camera for video capture.
Gemini: Initializes the Google GenAI client with your API key.
Kinematics: Loads a MuJoCo model of the robot to calculate the joint angles required to reach specific 3D coordinates using Inverse Kinematics.

2. Calibration (The "Eye-Hand" Connection)

Before the robot can point at what it sees, it needs to know how pixels in the camera image relate to meters in the real world.

The script looks for a ChArUco board on the table.
It calculates a Homography Matrix, which maps 2D image points to 2D table coordinates.
This calibration is saved to homography_calibration.npy so you don't have to recalibrate every time.

3. The Vision-Guided Control Loop

Once calibrated, the system enters a continuous loop:

Observe: The camera captures a static image of the workspace.
Ask: You type a natural language query (e.g., "Where is the blue block?").
Think (Gemini): The system sends the current image and your text to Gemini. Gemini analyzes the image and returns the 2D pixel coordinates of the object.
Ground: The script uses the calibration matrix to convert those pixels into real-world robot coordinates (X, Y, Z).
Act: The robot calculates the necessary joint angles (Inverse Kinematics) and moves its arm to point at the object.

3. Get configuration parameters and run the script

To identify the USB port for your robot arm, use the lerobot-find-port command.

To identify the correct camera index, use the lerobot-find-cameras opencv command.

Provide configuration parameters as command-line arguments.

Execution

Before running the script, manually position the arm so it does not block the camera's view of the ChArUco board.

To run the script, provide your API key and adjust the other required values as needed:

python workshop.py \
--api-key "MY_GEMINI_API_KEY" \
--port "/dev/tty.usbmodem12345" \
--robot-id "my_so101_follower" \
--camera-index 0

Need a Gemini API key? Go to Google AI Studio to get one!

To use an existing arm calibration file, include the arg --calibration-dir:

python workshop.py \
--calibration-dir "path/to/calib_dir" \
...rest of args...

This should point to the directory that contains the calibration file, not the file itself. The filename must match the robot-id, e.g. my_so101_follower.json.

Calibration step

The first time you run the script, or if you use the --recalibrate flag, it will perform the ChArUco calibration.

Ensure the ChArUco board is in the camera's view and unobstructed before running the script.
The script will display a window showing the detected board (if successful) and save the homography matrix to homography_calibration.npy. Press q to close the window and continue the script.
The workshop's ChArUco board includes a registration outline which sets a known physical X (forward) and Y (left) distance from the robot's base to the Anchor Corner (ID 0) of the board. If you're placing the board differently, measure and provide the actual distance as the --board-origin X Y argument.

Interactive loop

After successful calibration, the script will enter an interactive loop:

The robot moves to the home position.
The script prompts: "⌨️ What should I point at? (e.g., 'blue block', 'pen'):"
The camera captures an image, and the image is sent to Gemini with your prompt.
Gemini returns the 2D pixel coordinate of the object's center.
The script uses the saved Homography matrix to convert the pixel coordinate to real-world (X, Y) robot coordinates.
The robot executes a sequence: Home -> Hover 10cm above table -> Descend to Point 2cm above table.
The loop repeats until you type q.

Key script arguments

Argument	Description	Default Value	Required?
--api-key	Your Google AI Studio API Key.	N/A	Yes
--port	Serial port connected to the robot arm.	N/A	Yes
--robot-id	ID of the robot arm; must match calibration filename without extension.	N/A	Yes
--camera-index	Index of the USB camera (try 0, 1, or 2).	N/A	Yes
--calibration-dir	Directory containing the arm calibration files, when not using the default location or the lerobot-calibrate command.	N/A	No
--board-origin	X (forward) and Y (left) robot coordinates (meters) of the ChArUco board's Anchor Corner (ID 0).	0.29 0.0525	No
--recalibrate	Flag to force recalibration.	N/A	No

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
media		media
third_party/SO101		third_party/SO101
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
workshop.py		workshop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Guided Manipulation with Gemini & Robot Arms

1. Prerequisites

Environment setup (Linux/macOS)

2. Conceptual Walkthrough

1. Initialization

2. Calibration (The "Eye-Hand" Connection)

3. The Vision-Guided Control Loop

3. Get configuration parameters and run the script

Execution

Calibration step

Interactive loop

Key script arguments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision-Guided Manipulation with Gemini & Robot Arms

1. Prerequisites

Environment setup (Linux/macOS)

2. Conceptual Walkthrough

1. Initialization

2. Calibration (The "Eye-Hand" Connection)

3. The Vision-Guided Control Loop

3. Get configuration parameters and run the script

Execution

Calibration step

Interactive loop

Key script arguments

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages