Chuhao Liu1, Zhijian Qiao1, Jieqi Shi2,*, Ke Wang3, Peize Liu 1 and Shaojie Shen1
1HKUST Aerial Robotics Group
2 NanJing University
3Chang'an University
*Corresponding Author
- [21 Apr 2025] Publish the initial version of code.
- [19 Apr 2025] Our paper is accepted by IEEE T-RO as a regular paper.
- [8 Oct 2024] Paper submitted to IEEE T-RO.
In this work, we learn to register two semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. To acehive a generalizable registration in the real-world, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. SG-Reg represents a dense indoor scene in coarse node features and dense point features. In multi-agent SLAM systems, this representation supports both coarse-to-fine localization and bandwidth-efficient communication. We generate semantic scene graph using vision foundation models and semantic mapping module FM-Fusion. It eliminates the need for ground-truth semantic annotations, enabling fully self-supervised network training. We evaluate our method using real-world RGB-D sequences: ScanNet, 3RScan and self-collected data using Realsense i-435.
Create virtual environment,
conda create sgreg python=3.9
Install PyTorch 2.1.2 and other dependencies.
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
python setup.py build develop
Download the 3RScan (RIO) data 坚果云nutStore link. It involves RIO_DATAROOT
, the data are organized in the following structures.
|--val
|--scenexxxx_00a % each individual scene graph
|-- ....
|--splits
|-- val.txt
|--gt
|-- SRCSCENE-REFSCENE.txt % T_ref_src
|--matches
|-- SRCSCENE-REFSCENE.pth % ground-truth node matches
|--output
|--CHECKPOINT_NAME % default: sgnet_scannet_0080
|--SRCSCENE-REFSCENE % results of scene pair
We also provide another 50 pairs of ScanNet scenes. Please download the ScanNet data using this 坚果云nutStore link. They are organized in the same data structure as the 3RScan data.
*Note: We did not use any ground-truth semantic annotation from 3RScan or ScanNet. The downloaded scene graphs are reconstructed using FM-Fusion. You can also download the original RGB-D sequences and build your scene graphs using FM-Fusion. If you want to try, ScanNet sequences should be easier to start with.
Find the config/rio.yaml and set the dataroot/dataroot
to be the RIO_DATASET
directory on your machine. Then, run the inference program,
python sgreg/val.py --cfg_file config/rio.yaml
It will inference all of the downloaded scene pairs in 3RScan. The registration results, including matched nodes, point correspondences and predicted transformation are saved at RIO_DATAROOT/ouptut/CHECKPOINT_NAME/SRCSCENE-REFSCENE
. You can visualize the registration results,
python sgreg/visualize.py --dataroot $RIO_DATAROOT$ --viz_mode 1 --find_gt --viz_translation [3.0,5.0,0.0]
It should visualize the results as below,
On the left column, you can select the entities you want to visualize.If you run the program on a remote server, rerun supports remote visualization (see rerun connect_tcp). Check the arguments instruction in visualize.py to customize your visualization.
[Optional] If you want to evaluate SG-Reg on ScanNet sequences, adjust the running options as below,
python sgreg/val.py --cfg_file config/scannet.yaml
python sgreg/visualize.py --dataroot $SCANNET_DATAROOT$ --viz_mode 1 --augment_transform --viz_translation [3.0,5.0,0.0]
We think generalization capability remains to be a key challenge in 3D semantic perception. If you are interested in the task we are doing, we encourage you to collect your own RGB-D sequence to evaluate. It requires VINS-Mono to compute camera poses, Grounded-SAM to generate semantic labels, and FM-Fusion to reconstruct a semantic scene graph. We will add a detailed instruction later to illustrate how to build your own data.
- Scene graph network code and verify its inference.
- Remove unncessary dependencies.
- Clean the data structure.
- Visualize the results.
- Provide RIO scene graph data for download.
- Provide network weight for download.
- Publish checkpoint on Huggingface Hub and reload.
- Registration back-end in python interface. (The version used in the paper is a C++ version.)
- Validation the entire system in a new computer.
- A tutorial for running the validation.
We will continue to maintain this repo. If you encounter any problem in using it, feel free to publish an issue. We'll try to help.
We used some of the code from GeoTransformer, SG-PGM and LightGlue. SkyLand provides lidar-camera suite to allow us evaluating SG-Reg in large-scale scenes (as demonstrated at the end of the video).
The source code is released under GPLv3 license. For technical issues, please contact Chuhao LIU ([email protected]).