Skip to content

ntnu-arl/semantic_inference_ros

Repository files navigation

Semantic Segmentation and VLM Reasoning in ROS

License: MIT ROS Version

This repository extends semantic_inference to provide closed and open set semantic segmentation methods. Additionally, it provides methods to extract CLIP embeddings of objects and relational embeddings using Visual Language Models (VLMs).


Table of Contents


Setup

General Requirements

These instructions assume ros-noetic-desktop-full is installed on Ubuntu 20.04.

Install the general dependencies:

sudo apt install python3-rosdep python3-catkin-tools

Clone the repository and initialize submodules:

git clone git@github.com:ntnu-arl/semantic_inference_ros.git
git submodule init
git submodule update --recursive

Virtual Environment

It is highly recommended to set up a Python virtual environment to run ROS Python nodes:

cd /path/to/catkin_ws/src/semantic_inference/semantic_inference_python
python3.8 -m venv --system-site-packages ros_semantics_env
source ros_semantics_env/bin/activate
pip install -U pip
pip install -r requirements.txt

Building

Install ROS dependencies:

cd /path/to/catkin_ws/src
rosdep install --from-paths . --ignore-src -r -y

For closed-set segmentation, follow the setup instructions (skip Python utilities) in semantic_inference closed-set docs.

Build the workspace:

catkin config -DCMAKE_BUILD_TYPE=Release
catkin build

Usage

Open-set Segmentation

Open-set segmentation consumes RGB-D images and camera information to perform semantic segmentation and extract open-vocabulary features for each object.

Supported open-set detectors: YOLOe and YOLOw. These can detect any list of objects without re-training.

roslaunch semantic_inference_ros openset_segmentation.launch

VLM for Object Relationship Embeddings

This method takes a segmented image along with its original RGB-D frame and computes visual features for each pair of detected objects. These features can be used to prompt a VLM for reasoning about relationships.

Supported VLMs: InstructBLIP and DeepSeek-VL2.

To use DeepSeek-VL2, first extract the visual encoder as a standalone model. For the large model (used in our experiments), we provide it here.

Alternmatively, the models can be extracted with the following command (~100GB RAM required for the large moded):

python semantic_inference_python/scripts/extract_deepseek_visual.py --model_name <model to use> --output_path <path to store model>

Then, set the model path in vlm.yaml.

Launch the node:

roslaunch semantic_inference_ros vlm_features.launch

VLM/LLM Reasoning

This section enables reasoning on the relationship-aware hierarchical scene graph.

  • LLMs predict relevant objects and interactions for given tasks
  • VLM responses are parsed by LLMs
  • OpenAI API key required, run:
export OPENAI_API_KEY=<Your OpenAI API Key>

VLM reasoning is performed on the cloud. Use DeepSeek-VL2 server code to run FastAPI server.

Steps to set up the server:

  1. Clone the server repo:
git clone git@github.com:ntnu-arl/DeepSeek-VL2.git -b server
cd DeepSeek-VL2
  1. Set up the Python virtual environment:
bash setup.sh
  1. Configure server path, port, and API key in run_server.sh.

  2. Run the server (model download may take time):

bash run_server.sh

Finally, set the server URL in vlm_for_navigation.yaml and export your FASTAPI_KEY:

export FASTAPI_API_KEY=<Your server FastAPI Key>

Citation

@inproceedings{puigjaner2026reasoninggraph,
    title={Relationship-Aware Hierarchical 3D Scene Graph},
    author={Gassol Puigjaner, Albert and Zacharia, Angelos and Alexis, Kostas},
    booktitle={2026 IEEE International Conference on Robotics and Automation (ICRA)}, 
    year={2026}
}

Zenodo DOI

https://doi.org/10.5281/zenodo.18496220


License

Released under BSD-3-Clause.


Acknowledgements

This open-source release is based on work supported by the European Commission through:

  • Project SYNERGISE, under Horizon Europe Grant Agreement No. 101121321

Contact

For questions or support, reach out via GitHub Issues or contact the authors:


About

Semantic scene understanding models integrated with ROS

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors