
Tian-Xing Xu1,
Xiangjun Gao3,
Wenbo Hu2 †,
Xiaoyu Li2,
Song-Hai Zhang1 †,
Ying Shan2
1Tsinghua University
2ARC Lab, Tencent PCG
3HKUST
GeometryCrafter is still under active development!
We recommend that everyone use English to communicate on issues, as this helps developers from around the world discuss, share experiences, and answer questions together. For further implementation details, please contact [email protected]
. For business licensing and other related inquiries, don't hesitate to contact [email protected]
.
If you find GeometryCrafter useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!
We present GeometryCrafter, a novel approach that estimates temporally consistent, high-quality point maps from open-world videos, facilitating downstream applications such as 3D/4D reconstruction and depth-based video editing or generation.
Release Notes:
[28/04/2025]
🤗🤗🤗 We release our implemented SfM method for in-the-wild videos, based on SAM2, glue-factory and SpaTracker.[14/04/2025]
🚀🚀🚀 We provide alow_memory_usage
option in pipeline for saving GPU memory usage, thanks to calledit's helpful suggestion.[01/04/2025]
🔥🔥🔥GeometryCrafter is released now, have fun!
- Clone this repo:
git clone --recursive https://github.com/TencentARC/GeometryCrafter
- Install dependencies (please refer to requirements.txt):
pip install -r requirements.txt
Run inference code on our provided demo videos at 1.27FPS, which requires a GPU with ~40GB memory for 110 frames with 1024x576 resolution:
python run.py \
--video_path examples/video1.mp4 \
--save_folder workspace/examples_output \
--height 576 --width 1024
# resize the input video to the target resolution for processing, which should be divided by 64
# the output point maps will be restored to the original resolution before saving
# you can use --downsample_ratio to downsample the input video or reduce --decode_chunk_size to save the memory usage
Run inference code with our deterministic variant at 1.50 FPS
python run.py \
--video_path examples/video1.mp4 \
--save_folder workspace/examples_output \
--height 576 --width 1024 \
--model_type determ
Run low-resolution processing at 2.49 FPS, which requires a GPU with ~22GB memory:
python run.py \
--video_path examples/video1.mp4 \
--save_folder workspace/examples_output \
--height 384 --width 640
Run low-resolution processing at 1.76 FPS with <20GB memory usage, following the advice of calledit in Pull Request 1:
python run.py \
--video_path examples/video1.mp4 \
--save_folder workspace/examples_output \
--height 384 --width 640 \
--low_memory_usage True \
--decode_chunk_size 6
Visualize the predicted point maps with Viser
python visualize/vis_point_maps.py \
--video_path examples/video1.mp4 \
--data_path workspace/examples_output/video1.npz
- Online demo: GeometryCrafter
- Local demo:
gradio app.py
Please check the evaluation
folder.
- To create the dataset we use in the paper, you need to run
evaluation/preprocess/gen_{dataset_name}.py
. - You need to change
DATA_DIR
andOUTPUT_DIR
first accordint to your working environment. - Then you will get the preprocessed datasets containing extracted RGB video and point map npz files. We also provide the catelog of these files.
- Inference for all datasets scripts:
(Remember to replace the
bash evaluation/run_batch.sh
data_root_dir
andsave_root_dir
with your path.) - Evaluation for all datasets scripts (scale-invariant point map estimation):
(Remember to replace the
bash evaluation/eval.sh
pred_data_root_dir
andgt_data_root_dir
with your path.) - Evaluation for all datasets scripts (affine-invariant depth estimation):
(Remember to replace the
bash evaluation/eval_depth.sh
pred_data_root_dir
andgt_data_root_dir
with your path.) - We also provide the comparison results of MoGe and the deterministic variant of our method. You can evaluate these methods under the same protocol by uncomment the corresponding lines in
evaluation/run.sh
evaluation/eval.sh
evaluation/run_batch.sh
andevaluation/eval_depth.sh
.
Leveraging the temporally consistent point maps output by GeometryCrafter, we implement a camera pose estimation method designed for in-the-wild videos. We hope that our work will serve as a launchpad for 4D reconstruction. Our implementation can be summarized as follows
- Segment the dynamic objects from the video with SAM2. We refer to a huggingface demo here, thanks to fffiloni's great work.
- Find a set of feature points in the static background with SIFT and SuperPoint implemented by glue-factory
- Track these points with SpaTracker, which takes the monocular video and metric depth maps as input.
- Use gradient descent to solve the point-set rigid transformation problem (3-DoF rotation and 3-DoF translation), based on the tracking results. More details can be found in our paper.
# We provide an example here
VIDEO_PATH=examples/video7.mp4
POINT_MAP_PATH=workspace/examples_output/video7.npz
MASK_PATH=examples/video7_mask.mp4
TRACK_DIR=workspace/trackers/video7
SFM_DIR=workspace/sfm/video7
# Download the checkpoints of SpaTracker and Superpoint and put them in the following path
# - pretrained_models/spaT_final.pth
# - pretrained_models/superpoint_v6_from_tf.pth
# Here's the urls
# - SpaTracker: https://drive.google.com/drive/folders/1UtzUJLPhJdUg2XvemXXz1oe6KUQKVjsZ?usp=sharing
# - SuperPoint: https://github.com/rpautrat/SuperPoint/raw/master/weights/superpoint_v6_from_tf.pth
python sfm/run_track.py \
--video_path ${VIDEO_PATH} \
--point_map_path ${POINT_MAP_PATH} \
--mask_path ${MASK_PATH} \
--out_dir ${TRACK_DIR} \
--vis_dir ${TRACK_DIR} \
--use_ori_res \
--spatracker_checkpoint pretrained_models/spaT_final.pth \
--superpoint_checkpoint pretrained_models/superpoint_v6_from_tf.pth
python sfm/run.py \
--num_iterations 2000 \
--video_path ${VIDEO_PATH} \
--point_map_path ${POINT_MAP_PATH} \
--mask_path ${MASK_PATH} \
--track_dir ${TRACK_DIR} \
--out_dir ${SFM_DIR} \
--use_ori_res
# You'll find the processed dataset used for 4D reconstruction in ${SFM_DIR}
# Visualize per-frame point maps in the world coordinates
python sfm/vis_points.py \
--sfm_dir ${SFM_DIR}
- Welcome to open issues and pull requests.
- Welcome to optimize the inference speed and memory usage, e.g., through model quantization, distillation, or other acceleration techniques.
We have used codes from other great research work, including DepthCrafter, MoGe, SAM2, glue-factory and SpaTracker. We sincerely thank the authors for their awesome work!
If you find this work helpful, please consider citing:
@article{xu2025geometrycrafter,
title={GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors},
author={Xu, Tian-Xing and Gao, Xiangjun and Hu, Wenbo and Li, Xiaoyu and Zhang, Song-Hai and Shan, Ying},
journal={arXiv preprint arXiv:2504.01016},
year={2025}
}