TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang^1,3,2*, Yonghao He^2*†, Licheng Yang^1,3, Wei Zou^1,3, Hongxuan Ma³, Liu Liu⁴,
Wei Sui², Yuxin Guo^1,3, Hu Su^3✉

¹School of Artificial Intelligence, University of Chinese Academy of Sciences
²D-Robotics
³State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS),
Institute of Automation, Chinese Academy of Sciences
⁴Horizon Robotics

^*Equal contribution ^†Project Lead ^✉Corresponding author

SceneShowcase_compressed.mp4

🎉 Updates

[2025-12-10] 🎉 TabletopGen is now open source!

🧩 Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI—especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity.

🚀 Installation

This project utilizes two distinct environments, tabletopgen and rotation, to handle complex dependencies.

We provide an automated setup workflow. You do not need to manually configure the two environments or compile dependencies one by one.

1. Clone the Repository

git clone https://github.com/D-Robotics-AI-Lab/TabletopGen.git
cd TabletopGen

2. One-Click Environment Setup

We provide a shell script that automatically:

Creates the primary environment tabletopgen (CUDA 11.8, Torch 2.6).
Compiles Grounded-SAM-2 and installs BiRefNet.
Creates the secondary environment rotation (CUDA 12.1, PyTorch3D).

For Linux Users:

Please export your local CUDA path before running the script (required for compiling Grounded-SAM-2):

# Replace with your own CUDA path (e.g., /usr/local/cuda-11.8)
export CUDA_HOME=/path/to/cuda-11.8 
bash install_env.sh

☕ Note: This process involves compiling CUDA extensions locally. It may take a few minutes depending on your network and CPU.

3. Download Model Weights

Run this script to automatically download the correct checkpoints for BiRefNet, SAM 2.1, and Grounding DINO to their respective directories.

# Activate the main environment first
conda activate tabletopgen

# Run the auto-download script
python install_scripts/download_weights.py

🛠️ Usage

1. Configuration

Before running the pipeline, please configure your API settings (e.g., OpenAI, Hunyuan3D, etc.) in the configuration file:

# Edit this file with your own API settings
configs/config.yaml

2. Generate Input Image (Optional)

If you do not have an input image, you can generate one from text using text2img.py.

Arguments:
- --doubao_api_key: Your API key for the generation service.
- --text: Description of the scene (e.g., "A hobby desk with some model cars and tools.").
- --id (Optional): Manually specify the generated image ID. If omitted, it auto-increments.
Output: Generated images will be saved in scene_image/.

conda activate tabletopgen
python text2img.py --doubao_api_key "YOUR_API_KEY" --text "A hobby desk with some model cars and tools."

3. Run Scene Generation Pipeline

Run the main pipeline to generate the 3D scene.

Arguments:

--input_image (Required): Path to the input image file.
--scene_id (Optional): Manually specify the Scene ID (directory name).
--skip_step (Optional): Skip specific pipeline steps (space-separated integers). Useful for debugging or resuming.

Example Commands:

conda activate tabletopgen
python pipeline.py --input_image scene_image/scene_image_1.png

💡 Critical Tip for Best Results: In Step 1 of the pipeline, we strongly recommend adjusting the Grounded-SAM-2 thresholds to ensure all object instances are correctly segmented and extracted. You can tweak the following parameters in the pipeline code:

box_threshold

text_threshold

confidence_threshold

4. Visualization & Simulation

View GLB Model: Once the generation is complete, you can view the assembled 3D scene at: output_scene/scene_{id}/scene_{id}.glb

NVIDIA Isaac Sim (Physics-based Assembly): For a scene assembly with full physical properties, use the Isaac Sim script.

Prerequisite: Ensure NVIDIA Isaac Sim is installed (Installation Guide).

# Run the Isaac Sim visualization script
python isaac_final_scene.py

💬 Community & Discussion

Please scan the QR code to connect with us on WeChat and join the community for the latest updates and discussions with the authors.

Scan to connect with us

💝 Acknowledgments

We would like to express our gratitude to the following projects and services that made this work possible:

📝 Citation

If you use this code in your research, please cite our project:

@article{wang2025tabletopgen,
  title={TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image},
  author={Wang, Ziqian and He, Yonghao and Yang, Licheng and Zou, Wei and Ma, Hongxuan and Liu, Liu and Sui, Wei and Guo, Yuxin and Su, Hu},
  journal={arXiv preprint arXiv:2512.01204},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

🎉 Updates

🧩 Abstract

🚀 Installation

1. Clone the Repository

2. One-Click Environment Setup

3. Download Model Weights

🛠️ Usage

1. Configuration

2. Generate Input Image (Optional)

3. Run Scene Generation Pipeline

4. Visualization & Simulation

💬 Community & Discussion

💝 Acknowledgments

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
BiRefNet		BiRefNet
Grounded-SAM-2		Grounded-SAM-2
assets		assets
background_room		background_room
configs		configs
install_scripts		install_scripts
modules		modules
scene_image		scene_image
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
install_env.sh		install_env.sh
isaac_final_scene.py		isaac_final_scene.py
pipeline.py		pipeline.py
text2img.py		text2img.py

D-Robotics-AI-Lab/TabletopGen

Folders and files

Latest commit

History

Repository files navigation

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

🎉 Updates

🧩 Abstract

🚀 Installation

1. Clone the Repository

2. One-Click Environment Setup

3. Download Model Weights

🛠️ Usage

1. Configuration

2. Generate Input Image (Optional)

3. Run Scene Generation Pipeline

4. Visualization & Simulation

💬 Community & Discussion

💝 Acknowledgments

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages