Skip to content

Commit a153041

Browse files
authored
Merge pull request #6 from xpfgg/main
Change VLA language to Language-Guided Manipulation
2 parents 912f75f + 68214bd commit a153041

File tree

1 file changed

+5
-7
lines changed

1 file changed

+5
-7
lines changed

README.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# Vision-Guided Manipulation with Gemini & Robot Arms
22

3-
This repository contains a demonstration of a Vision-Language-Action (VLA)
4-
system using the SO-101 robot arm, an attached USB camera, and Gemini
5-
models for zero-shot object detection and pointing.
3+
This repository contains a demonstration of Language-Guided Manipulation using the SO-101 robot arm. It utilizes Gemini (gemini-3-flash-preview by default or gemini-robotics-er-1.5-preview) to enable zero-shot detection and pointing. Gemini is used to process images and return the 2D pixel coordinates of objects in the query. Then, the script uses a kinematics module to calculate the joint angles required to reach a target position in 3D space.
64

75
## 1. Prerequisites
86

@@ -41,23 +39,23 @@ The recommended way to set up your environment is based on the
4139

4240
## 2. Conceptual Walkthrough
4341

44-
This script (`workshop.py`) integrates vision, language, and action into a single loop. Here is how it works:
42+
This script (`workshop.py`) integrates vision, language, and action through into a single loop. Here is how it works:
4543

4644
### 1. Initialization
4745
- **Robot**: Connects to the SO-101 arm via serial port.
4846
- **Camera**: Opens the USB camera for video capture.
4947
- **Gemini**: Initializes the Google GenAI client with your API key.
50-
- **Kinematics**: Loads the MuJoCo physics engine to calculate how to move the robot's joints to reach specific 3D coordinates (Inverse Kinematics).
48+
- **Kinematics**: Loads a MuJoCo model of the robot to calculate the joint angles required to reach specific 3D coordinates using Inverse Kinematics.
5149

5250
### 2. Calibration (The "Eye-Hand" Connection)
5351
Before the robot can point at what it sees, it needs to know how pixels in the camera image relate to meters in the real world.
5452
- The script looks for a **ChArUco board** on the table.
5553
- It calculates a **Homography Matrix**, which maps 2D image points to 2D table coordinates.
5654
- This calibration is saved to `homography_calibration.npy` so you don't have to recalibrate every time.
5755
58-
### 3. The Main Loop (Vision-Language-Action)
56+
### 3. The Vision-Guided Control Loop
5957
Once calibrated, the system enters a continuous loop:
60-
1. **Observe**: The camera shows a live feed of the workspace.
58+
1. **Observe**: The camera captures a static image of the workspace.
6159
2. **Ask**: You type a natural language query (e.g., "Where is the blue block?").
6260
3. **Think (Gemini)**: The system sends the current image and your text to Gemini. Gemini analyzes the image and returns the 2D pixel coordinates of the object.
6361
4. **Ground**: The script uses the calibration matrix to convert those pixels into real-world robot coordinates (X, Y, Z).

0 commit comments

Comments
 (0)