Skip to content

TencentARC/GeometryCrafter

Repository files navigation

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors


Tian-Xing Xu1, Xiangjun Gao3, Wenbo Hu2 †, Xiaoyu Li2, Song-Hai Zhang1 †, Ying Shan2
1Tsinghua University 2ARC Lab, Tencent PCG 3HKUST

Version        

🔆 Notice

GeometryCrafter is still under active development!

We recommend that everyone use English to communicate on issues, as this helps developers from around the world discuss, share experiences, and answer questions together. For further implementation details, please contact [email protected]. For business licensing and other related inquiries, don't hesitate to contact [email protected].

If you find GeometryCrafter useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!

📝 Introduction

We present GeometryCrafter, a novel approach that estimates temporally consistent, high-quality point maps from open-world videos, facilitating downstream applications such as 3D/4D reconstruction and depth-based video editing or generation.

Release Notes:

  • [28/04/2025] 🤗🤗🤗 We release our implemented SfM method for in-the-wild videos, based on SAM2, glue-factory and SpaTracker.
  • [14/04/2025] 🚀🚀🚀 We provide a low_memory_usage option in pipeline for saving GPU memory usage, thanks to calledit's helpful suggestion.
  • [01/04/2025] 🔥🔥🔥GeometryCrafter is released now, have fun!

🚀 Quick Start

Installation

  1. Clone this repo:
git clone --recursive https://github.com/TencentARC/GeometryCrafter
  1. Install dependencies (please refer to requirements.txt):
pip install -r requirements.txt

Inference

Run inference code on our provided demo videos at 1.27FPS, which requires a GPU with ~40GB memory for 110 frames with 1024x576 resolution:

python run.py \
  --video_path examples/video1.mp4 \
  --save_folder workspace/examples_output \
  --height 576 --width 1024
  # resize the input video to the target resolution for processing, which should be divided by 64 
  # the output point maps will be restored to the original resolution before saving
  # you can use --downsample_ratio to downsample the input video or reduce --decode_chunk_size to save the memory usage

Run inference code with our deterministic variant at 1.50 FPS

python run.py \
  --video_path examples/video1.mp4 \
  --save_folder workspace/examples_output \
  --height 576 --width 1024 \
  --model_type determ

Run low-resolution processing at 2.49 FPS, which requires a GPU with ~22GB memory:

python run.py \
  --video_path examples/video1.mp4 \
  --save_folder workspace/examples_output \
  --height 384 --width 640

Run low-resolution processing at 1.76 FPS with <20GB memory usage, following the advice of calledit in Pull Request 1:

python run.py \
  --video_path examples/video1.mp4 \
  --save_folder workspace/examples_output \
  --height 384 --width 640 \
  --low_memory_usage True \
  --decode_chunk_size 6

Visualization

Visualize the predicted point maps with Viser

python visualize/vis_point_maps.py \
  --video_path examples/video1.mp4 \
  --data_path workspace/examples_output/video1.npz

🤖 Gradio Demo

📊 Dataset Evaluation

Please check the evaluation folder.

  • To create the dataset we use in the paper, you need to run evaluation/preprocess/gen_{dataset_name}.py.
  • You need to change DATA_DIR and OUTPUT_DIR first accordint to your working environment.
  • Then you will get the preprocessed datasets containing extracted RGB video and point map npz files. We also provide the catelog of these files.
  • Inference for all datasets scripts:
    bash evaluation/run_batch.sh
    (Remember to replace the data_root_dir and save_root_dir with your path.)
  • Evaluation for all datasets scripts (scale-invariant point map estimation):
    bash evaluation/eval.sh
    (Remember to replace the pred_data_root_dir and gt_data_root_dir with your path.)
  • Evaluation for all datasets scripts (affine-invariant depth estimation):
    bash evaluation/eval_depth.sh
    (Remember to replace the pred_data_root_dir and gt_data_root_dir with your path.)
  • We also provide the comparison results of MoGe and the deterministic variant of our method. You can evaluate these methods under the same protocol by uncomment the corresponding lines in evaluation/run.sh evaluation/eval.sh evaluation/run_batch.sh and evaluation/eval_depth.sh.

📷 Camera Pose Estimation for In-the-wild Videos

Leveraging the temporally consistent point maps output by GeometryCrafter, we implement a camera pose estimation method designed for in-the-wild videos. We hope that our work will serve as a launchpad for 4D reconstruction. Our implementation can be summarized as follows

  • Segment the dynamic objects from the video with SAM2. We refer to a huggingface demo here, thanks to fffiloni's great work.
  • Find a set of feature points in the static background with SIFT and SuperPoint implemented by glue-factory
  • Track these points with SpaTracker, which takes the monocular video and metric depth maps as input.
  • Use gradient descent to solve the point-set rigid transformation problem (3-DoF rotation and 3-DoF translation), based on the tracking results. More details can be found in our paper.
# We provide an example here
VIDEO_PATH=examples/video7.mp4
POINT_MAP_PATH=workspace/examples_output/video7.npz
MASK_PATH=examples/video7_mask.mp4
TRACK_DIR=workspace/trackers/video7
SFM_DIR=workspace/sfm/video7

# Download the checkpoints of SpaTracker and Superpoint and put them in the following path
# - pretrained_models/spaT_final.pth
# - pretrained_models/superpoint_v6_from_tf.pth

# Here's the urls
# - SpaTracker: https://drive.google.com/drive/folders/1UtzUJLPhJdUg2XvemXXz1oe6KUQKVjsZ?usp=sharing
# - SuperPoint: https://github.com/rpautrat/SuperPoint/raw/master/weights/superpoint_v6_from_tf.pth

python sfm/run_track.py \
    --video_path ${VIDEO_PATH} \
    --point_map_path ${POINT_MAP_PATH} \
    --mask_path ${MASK_PATH} \
    --out_dir ${TRACK_DIR} \
    --vis_dir ${TRACK_DIR} \
    --use_ori_res \
    --spatracker_checkpoint pretrained_models/spaT_final.pth \
    --superpoint_checkpoint pretrained_models/superpoint_v6_from_tf.pth

python sfm/run.py \
    --num_iterations 2000 \
    --video_path ${VIDEO_PATH} \
    --point_map_path ${POINT_MAP_PATH} \
    --mask_path ${MASK_PATH} \
    --track_dir ${TRACK_DIR} \
    --out_dir ${SFM_DIR} \
    --use_ori_res

# You'll find the processed dataset used for 4D reconstruction in ${SFM_DIR}
# Visualize per-frame point maps in the world coordinates

python sfm/vis_points.py \
    --sfm_dir ${SFM_DIR}

⚠️ Camera pose estimation is NOT the primary objective and the core contribution of GeometryCrafter. This simplified application just demonstrates the potential for 4D reconstruction using GeometryCrafter. If you find it useful, please help ⭐ this repo.

⚠️ According to our experiments, it exhibits less robustness in certain cases. Camera pose estimation for dynamic videos remains a challenging problem for researchers.

🤝 Contributing

  • Welcome to open issues and pull requests.
  • Welcome to optimize the inference speed and memory usage, e.g., through model quantization, distillation, or other acceleration techniques.

❤️ Acknowledgement

We have used codes from other great research work, including DepthCrafter, MoGe, SAM2, glue-factory and SpaTracker. We sincerely thank the authors for their awesome work!

📜 Citation

If you find this work helpful, please consider citing:

@article{xu2025geometrycrafter,
  title={GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors},
  author={Xu, Tian-Xing and Gao, Xiangjun and Hu, Wenbo and Li, Xiaoyu and Zhang, Song-Hai and Shan, Ying},
  journal={arXiv preprint arXiv:2504.01016},
  year={2025}
}

About

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published