update docs

Steven-xzr · Steven-xzr · commit dd1f82351b2a · 2025-04-02T21:32:50.000+08:00
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 
 
-# <a href="https://https://demo-generation.github.io/">𝑫𝒆𝒎𝒐𝑮𝒆𝒏: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning</a>
+# <a href="https://demo-generation.github.io/">𝑫𝒆𝒎𝒐𝑮𝒆𝒏: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning</a>
 
-<a href="https://https://demo-generation.github.io/"><strong>Project Page</strong></a> | <a href="https://arxiv.org/abs/2502.16932"><strong>arXiv</strong></a> | <a href="https://x.com/ZhengrongX/status/1899134914416800123"><strong>Twitter</strong></a> 
+<a href="https://demo-generation.github.io/"><strong>Project Page</strong></a> | <a href="https://arxiv.org/abs/2502.16932"><strong>arXiv</strong></a> | <a href="https://x.com/ZhengrongX/status/1899134914416800123"><strong>Twitter</strong></a> 
 
 
 
@@ -25,7 +25,7 @@ For action generation, 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 adopts the idea of Task and
 * **2025/04/02**, Officially released 𝑫𝒆𝒎𝒐𝑮𝒆𝒏.
 
 
-# 🚀 Quick Try
+# 🚀 Quick Try in 5 Minutes
 ## 1. Minimal Installation
 #### 1.0. Create conda Env
 ```bash
diff --git a/docs/1_data_collection.md b/docs/1_data_collection.md
@@ -1,6 +1,6 @@
 # Data Collection (for Your Own Task)
 
-We provide some source demos under the `data/datasets/source` folder. If you only want to get a sense of how 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 works, you can directly start from the provided demos, and jump to [data_generation](./2_data_generation.md). If you want to collect your own data, you can follow the steps below.
+We provide some source demos under the `data/datasets/source` folder. If you only want to get a sense of how 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 works, you can directly start from the provided demos, and jump to Quick Try in [README](../README.md) or the instructions in [data_generation](./2_data_generation.md). If you want to collect your own data, you can follow the steps below.
 
 
 
@@ -18,8 +18,10 @@ These dimension informations should be specified in the `shape_meta:` configurat
 ## Data Requirements
 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 can be applied to various platforms, including bimanual manipulation and dexterous-hand end-effectors. We provide an interface for collecting demos with keyboard for your reference in `real_world/collect_demo.py`.
 
-To facilitate synthetic generation of visual observations, 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 require the access to **3D point clouds**. This asks for preliminary camera calibration of the depth camera. You can follow the procedures in a [note](https://gist.github.com/hshi74/edabc1e9bed6ea988a2abd1308e1cc96) by Haochen Shi. The camera-related parameters should be noted in the beginning of `real_world/utils/pcd_process.py`.
+To facilitate synthetic generation of visual observations, 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 require the access to **3D point clouds**. This asks for preliminary camera calibration of the depth camera. You can follow the procedures in a [note](https://gist.github.com/hshi74/edabc1e9bed6ea988a2abd1308e1cc96) by Haochen Shi. The camera-related parameters should be noted in the beginning of `real_world/utils/pcd_process.py`. 
+
+Similar to many previous works using 3D point clouds as the visual observation, the unrelated points (i.e., those from the background and the table surface) should be **cropped out**. Once the camera calibration is ready, this can be easily done by specifying a workspace bounding box and discarding all the points outside the workspace.
 
 The point cloud we use is projected from **single-view** depth image instead of multi-view, since (1) the calibration process is time-consuming, (2) single-view camera is more practical for mobile platforms, e.g., ego-centric vision on a humanoid.
 
-Like DP3, we recommend the use of RealSense **L515** rather than the more commonly seen D435, because L515 captures higher-quality point clouds, e.g., fewer holes on the object surface, clearer boundaries between objects and background. We add a DBSCAN clustering step to discard the outlier points in the processing pipeline, which we found could effectively improve the quality of point clouds.
+Like DP3, we recommend the use of RealSense **L515** rather than the more commonly seen D435, because L515 captures higher-quality point clouds, e.g., fewer holes on the object surface, clearer boundaries between objects and background. We add a DBSCAN clustering step to discard the outlier points in the processing pipeline, which we found could effectively improve the quality of point clouds. Afterwards, the point cloud should undergo a farthest point sampling (FPS) process, downsampled to a fixed number of points, e.g., `1024`.
diff --git a/docs/2_data_generation.md b/docs/2_data_generation.md
@@ -1,3 +1,14 @@
 # Data Generation with 𝑫𝒆𝒎𝒐𝑮𝒆𝒏
 
-TBD.
+𝑫𝒆𝒎𝒐𝑮𝒆𝒏 is designed for the automatic generation of synthetic demonstrations. The unavoidable human effort in the 𝑫𝒆𝒎𝒐𝑮𝒆𝒏 pipeline lies in the pre-processing process, i.e., (1) segmenting the point cloud observation *only for the first frame*, and (2) parsing the source trajectory into object-centric segments. 
+
+## Point Cloud Segmentation
+Once we exclude the unrelated points outside the workspace and process the point cloud with clustering and FPS, the points in the cloud of the first frame should belong to either the robot end-effector or the object(s). In many cases, they can be easily separated by manually specifying a bounding box for the object, and the rest of the cloud should belong to the robot end-effector.
+
+We also provide a more elegant implementation to automate this process by leveraging open-vocabulary segmentation models (e.g., [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [LangSAM](https://github.com/luca-medeiros/lang-segment-anything)). More specifically, we only need to describe the manipulated objects with natural language. Taking the language prompt as input, these models can segment the corresponding objects on the RGB image. Next, since the RGB and depth image are pixel-aligned, we can project the segmented masks onto the depth image to obtain the point cloud segmentation. An implementation of this process is provided in `demo_generation/demo_generation/mask_util.py`.
+
+
+## Source Trajectory Parsing
+The source trajectory needs to be parsed into object-centric segments. For each object manipulated in the task, it is related to 2 sub-segments: (1) a *motion* segment that approaches the object, and (2) a *skill* segment that manipulates the object thourgh contact. Still, we have two options for trajectory parsing. The more straightforward one is in fact to manually specify the start frames for each sub-segment. You can run the demo generation code with `generation:range_name: src` and `generation:render_video: True`, and this will give you the rendered video of the source demonstration. You can easily tell the parsing frames by looking at the frame index marked on the top-left corner of the video.
+
+Still, we provide a more elegant way by checking whether the distance between the robot end-effector and the object point cloud falls below a threshold. While this automates the parsing process, it may require some manual tuning of the threshold, and therefore is not as practical as manual specification in many cases. The implementations are provided in `parse_frames_two_stage` and `parse_frames_one_stage` functions in `demo_generation/demo_generation/demogen.py`.