|
1 | 1 | # Vision-Guided Manipulation with Gemini & Robot Arms |
2 | 2 |
|
3 | | -This repository contains a demonstration of a Vision-Language-Action (VLA) |
4 | | -system using the SO-101 robot arm, an attached USB camera, and Gemini |
5 | | -models for zero-shot object detection and pointing. |
| 3 | +This repository contains a demonstration of Language-Guided Manipulation using the SO-101 robot arm. It utilizes Gemini (gemini-3-flash-preview by default or gemini-robotics-er-1.5-preview) to enable zero-shot detection and pointing. Gemini is used to process images and return the 2D pixel coordinates of objects in the query. Then, the script uses a kinematics module to calculate the joint angles required to reach a target position in 3D space. |
6 | 4 |
|
7 | 5 | ## 1. Prerequisites |
8 | 6 |
|
@@ -41,23 +39,23 @@ The recommended way to set up your environment is based on the |
41 | 39 |
|
42 | 40 | ## 2. Conceptual Walkthrough |
43 | 41 |
|
44 | | -This script (`workshop.py`) integrates vision, language, and action into a single loop. Here is how it works: |
| 42 | +This script (`workshop.py`) integrates vision, language, and action through into a single loop. Here is how it works: |
45 | 43 |
|
46 | 44 | ### 1. Initialization |
47 | 45 | - **Robot**: Connects to the SO-101 arm via serial port. |
48 | 46 | - **Camera**: Opens the USB camera for video capture. |
49 | 47 | - **Gemini**: Initializes the Google GenAI client with your API key. |
50 | | -- **Kinematics**: Loads the MuJoCo physics engine to calculate how to move the robot's joints to reach specific 3D coordinates (Inverse Kinematics). |
| 48 | +- **Kinematics**: Loads a MuJoCo model of the robot to calculate the joint angles required to reach specific 3D coordinates using Inverse Kinematics. |
51 | 49 |
|
52 | 50 | ### 2. Calibration (The "Eye-Hand" Connection) |
53 | 51 | Before the robot can point at what it sees, it needs to know how pixels in the camera image relate to meters in the real world. |
54 | 52 | - The script looks for a **ChArUco board** on the table. |
55 | 53 | - It calculates a **Homography Matrix**, which maps 2D image points to 2D table coordinates. |
56 | 54 | - This calibration is saved to `homography_calibration.npy` so you don't have to recalibrate every time. |
57 | 55 |
|
58 | | -### 3. The Main Loop (Vision-Language-Action) |
| 56 | +### 3. The Vision-Guided Control Loop |
59 | 57 | Once calibrated, the system enters a continuous loop: |
60 | | -1. **Observe**: The camera shows a live feed of the workspace. |
| 58 | +1. **Observe**: The camera captures a static image of the workspace. |
61 | 59 | 2. **Ask**: You type a natural language query (e.g., "Where is the blue block?"). |
62 | 60 | 3. **Think (Gemini)**: The system sends the current image and your text to Gemini. Gemini analyzes the image and returns the 2D pixel coordinates of the object. |
63 | 61 | 4. **Ground**: The script uses the calibration matrix to convert those pixels into real-world robot coordinates (X, Y, Z). |
|
0 commit comments