This repository extends semantic_inference to provide closed and open set semantic segmentation methods. Additionally, it provides methods to extract CLIP embeddings of objects and relational embeddings using Visual Language Models (VLMs).
These instructions assume ros-noetic-desktop-full is installed on Ubuntu 20.04.
Install the general dependencies:
sudo apt install python3-rosdep python3-catkin-toolsClone the repository and initialize submodules:
git clone git@github.com:ntnu-arl/semantic_inference_ros.git
git submodule init
git submodule update --recursiveIt is highly recommended to set up a Python virtual environment to run ROS Python nodes:
cd /path/to/catkin_ws/src/semantic_inference/semantic_inference_python
python3.8 -m venv --system-site-packages ros_semantics_env
source ros_semantics_env/bin/activate
pip install -U pip
pip install -r requirements.txtInstall ROS dependencies:
cd /path/to/catkin_ws/src
rosdep install --from-paths . --ignore-src -r -yFor closed-set segmentation, follow the setup instructions (skip Python utilities) in semantic_inference closed-set docs.
Build the workspace:
catkin config -DCMAKE_BUILD_TYPE=Release
catkin buildOpen-set segmentation consumes RGB-D images and camera information to perform semantic segmentation and extract open-vocabulary features for each object.
- Launch file: openset_segmentation.launch
- Configuration: openset_segmentation.yaml
Supported open-set detectors: YOLOe and YOLOw. These can detect any list of objects without re-training.
roslaunch semantic_inference_ros openset_segmentation.launchThis method takes a segmented image along with its original RGB-D frame and computes visual features for each pair of detected objects. These features can be used to prompt a VLM for reasoning about relationships.
- Launch file: vlm_features_node.launch
- Configuration: vlm.yaml
Supported VLMs: InstructBLIP and DeepSeek-VL2.
To use DeepSeek-VL2, first extract the visual encoder as a standalone model. For the large model (used in our experiments), we provide it here.
Alternmatively, the models can be extracted with the following command (~100GB RAM required for the large moded):
python semantic_inference_python/scripts/extract_deepseek_visual.py --model_name <model to use> --output_path <path to store model>Then, set the model path in vlm.yaml.
Launch the node:
roslaunch semantic_inference_ros vlm_features.launchThis section enables reasoning on the relationship-aware hierarchical scene graph.
- LLMs predict relevant objects and interactions for given tasks
- VLM responses are parsed by LLMs
- OpenAI API key required, run:
export OPENAI_API_KEY=<Your OpenAI API Key>VLM reasoning is performed on the cloud. Use DeepSeek-VL2 server code to run FastAPI server.
Steps to set up the server:
- Clone the server repo:
git clone git@github.com:ntnu-arl/DeepSeek-VL2.git -b server
cd DeepSeek-VL2- Set up the Python virtual environment:
bash setup.sh-
Configure server path, port, and API key in
run_server.sh. -
Run the server (model download may take time):
bash run_server.shFinally, set the server URL in vlm_for_navigation.yaml and export your FASTAPI_KEY:
export FASTAPI_API_KEY=<Your server FastAPI Key>@inproceedings{puigjaner2026reasoninggraph,
title={Relationship-Aware Hierarchical 3D Scene Graph},
author={Gassol Puigjaner, Albert and Zacharia, Angelos and Alexis, Kostas},
booktitle={2026 IEEE International Conference on Robotics and Automation (ICRA)},
year={2026}
}https://doi.org/10.5281/zenodo.18496220
Released under BSD-3-Clause.
This open-source release is based on work supported by the European Commission through:
- Project SYNERGISE, under Horizon Europe Grant Agreement No. 101121321
For questions or support, reach out via GitHub Issues or contact the authors: