This repository contains the implementation of the paper Language-Assisted 3D Feature Learning for Semantic Scene Understanding.
The code was developed and tested on Ubuntu 18.04, with PyTorch 1.6.0 CUDA 10.2 installed. Please execute the following command to install PyTorch:
conda create -n lang-3d python=3.8
conda activate lang-3d
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorchInstall the necessary packages listed out in requirements.txt:
pip install -r requirements.txtAfter all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:
cd lib/pointnet2
python setup.py installBefore moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.
We use the data generated by ScanRefer codebase. You can follow the Data preparation guide to preprocess data. Then, put it under the data folder.
Please follow follow the README under the language_parser folder. Copy the generated files ScanRefer_filtered_train_parser.json and ScanRefer_filtered_val_parser.json under the data folder.
For convenience, we release language parser results ScanRefer_filtered_train_parser.json and ScanRefer_filtered_val_parser.json in Release Pages.
Finally, the dataset files should be organized as follows.
data
├── glove.p
├── scannet
│ ├── batch_load_scannet_data.py
│ ├── load_scannet_data.py
│ ├── meta_data/
│ ├── model_util_scannet.py
│ ├── README.md
│ ├── scannet_data/
│ ├── scannet_utils.py
│ ├── scans/
│ └── visualize.py
├── ScanRefer_filtered_train_parser.json
├── ScanRefer_filtered_val_parser.json
To train the ScanRefer model for detection with RGB values:
python -u -m torch.distributed.launch --nproc_per_node=8 scripts/train.py --use_color --relation_prediction --color_prediction --size_prediction --shape_prediction --no_reference --batch_size 12 --val_step 1 --lr 8e-3 --epoch 60To train the ScanRefer model for detection with multiview values:
python -u -m torch.distributed.launch --nproc_per_node=8 scripts/train.py --use_multiview --use_normal --relation_prediction --color_prediction --size_prediction --shape_prediction --no_reference --batch_size 12 --val_step 1 --lr 8e-3 --epoch 60To train the ScanRefer model for visual grounding with RGB values:
python -u -m torch.distributed.launch --nproc_per_node=8 scripts/train.py --use_color --relation_prediction --color_prediction --size_prediction --shape_prediction --batch_size 12 --val_step 1 --lr 8e-3 --epoch 60To train the ScanRefer model for detection with multiview values:
python -u -m torch.distributed.launch --nproc_per_node=8 scripts/train.py --use_multiview --use_normal --relation_prediction --color_prediction --size_prediction --shape_prediction --batch_size 12 --val_step 1 --lr 8e-3 --epoch 60To evaluate the trained ScanRefer models for detection, please find the folder under outputs/ with the current timestamp and run:
python scripts/eval.py --folder <folder_name> --detection --use_color --no_nms --force --repeat 5To evaluate the trained ScanRefer models for visual grounding, please find the folder under outputs/ with the current timestamp and run:
python scripts/eval.py --folder <folder_name> --reference --use_color --no_nms --force --repeat 5To predict the localization results predicted by the trained ScanRefer model in a specific scene, please find the corresponding folder under outputs/ with the current timestamp and run:
python scripts/visualize.py --folder <folder_name> --scene_id <scene_id> --use_colorNote that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The output .ply files will be stored under outputs/<folder_name>/vis/<scene_id>/
If you find our work helpful for your research. Please consider citing our paper.
@inproceedings{zhang2022language,
title={Language-Assisted 3D Feature Learning for Semantic Scene Understanding},
author={Zhang, Junbo and Fan, Guofan and Wang, Guanghan and Su, Zhengyuan and Ma, Kaisheng and Yi, Li},
booktitle={AAAI},
year={2023}
}
Our code is based on SceneGraphParser and ScanRefer. Thanks to all.
Language-Assisted-3D is released under the MIT License. See the LICENSE file for more details.
