DA3-Streaming: Memory-Efficient Inference for Videos via Chunk Streaming

This repo introduces a streaming pipeline that enables Depth Anything 3 to process ⭐ long video sequences and super-large scale scenes ⭐ under tight CPU/GPU memory budgets by chunking frames and managing state across chunks. Built on the ideas of VGGT-Long, it focuses on memory efficiency and stable online inference for near-real-time video processing.

DA3-Streaming is built on the VGGT-Long and Depth Anything 3.

Updates

[11 Nov 2025] Code of DA3-Streaming release.

Setup, Installation & Running

📮 1 - Clone this project

Clone the repo using the --recursive flag

git clone --recursive https://github.com/ByteDance-Seed/Depth-Anything-3.git

If you forgot --recursive

cd <your_dir>/Depth-Anything-3/
git submodule update --init --recursive .

📦 2 - Environment Setup

Step 1: Dependency Installation

Install Depth-Anything-3 first.

pip install -r requirements.txt

Step 2: Weights Download

Download all the pre-trained weights needed:

bash ./scripts/download_weights.sh

System dependencies you may encounter.

If you encounter an error about libGL.so.1 (the error comes from opencv-python), please run the following cmd to install the system dependencies.

sudo apt-get install -y libgl1-mesa-glx

🚀 3 - Running the code

python da3_streaming.py --image_dir ./path_of_images

python da3_streaming.py --image_dir ./path_of_images --config ./configs/base_config.yaml --output_dir ${OUTPUT_DIR}

You may run the following cmd if you got videos before python da3_streaming.py.

mkdir ./extract_images
ffmpeg -i your_video.mp4 -vf "fps=5,scale=640:-1" ./extract_images/frame_%06d.png

4 - Outputs

Basic Outputs

After running the code, you will get the following outputs:

${OUTPUT_DIR}/camera_poses.txt: The camera poses file. Each line contains the extrinsic matrix parameters of a frame.
${OUTPUT_DIR}/intrinsic.txt: The intrinsic parameters of the camera. Each line contains fx, fy, cx, cy of a frame.
${OUTPUT_DIR}/pcd/combined_pcd.ply: The combined point cloud file. It contains the 3D points from all frames.

Additional Outputs

If setting save_depth_conf_result in config to True, you will get outputs for each frame:

${OUTPUT_DIR}/results_output: The folder that contains the rgb, depth, confidence and intrinsic results for each frame. Note that the minimum value of confidence is 0.

To verify the results, you can use the following cmd to fuse point cloud from ./results_output. The fused point cloud file will be saved as ${OUTPUT_DIR}/output.ply.

python npz_output_process.py --npz_folder ${OUTPUT_DIR}/results_output --pose_file ${OUTPUT_DIR}/camera_poses.txt --output_file ${OUTPUT_DIR}/output.ply

Note on Space Requirements: Please ensure your machine has sufficient disk space before running the code for DA3-Streaming. When finishing, the code will delete these intermediate results to prevent excessive disk usage.

Experiment Results

We conducted some additional experiments to compare the performance differences among different architectures. Below is the comparison of ATE RMSE [m] on KITTI Odometry between DA3-Streaming, VGGT-Long and Pi-Long. All methods are evaluated with overlap equal to half chunk size, comparable resolution (~500px-width), and loop closure with similarity threshold 0.85.

Method	chunk size	AVG	AVG (w/o 01)	00	01	02	03	04	05	06	07	08	09	10
Num. of Frames		2109	2210	4542	1101	4661	801	271	2761	1101	1101	4071	1591	1201
VGGT-Long	120	25.60	22.81	16.13	53.43	51.98	4.37	2.15	12.69	11.33	3.60	70.29	34.55	21.05
Pi-Long	120	21.17	11.81	5.55	114.83	50.29	1.63	1.11	3.48	2.88	3.92	24.25	7.38	17.61
DA3-Streaming	120	18.63	10.42	4.48	100.77	33.41	3.58	2.39	3.95	7.59	2.09	31.20	8.06	7.44
VGGT-Long	60	26.36	19.30	8.06	96.96	34.16	6.83	4.16	9.15	4.68	2.68	63.15	32.24	27.87
Pi-Long	60	30.63	17.10	7.82	165.92	73.59	3.67	0.91	5.16	3.89	3.57	33.97	17.01	21.41
DA3-Streaming	60	16.83	10.64	5.13	78.76	35.64	5.38	3.18	3.04	2.83	2.32	26.55	8.86	13.42

In DA3-Streaming, we restructured the code and accelerated it using GPU technology. Currently, our method achieves a running speed of nearly 10 FPS without the Keyframe strategy (the test time has excluded warm-up, model loading and ply result saving). Following results are evaluated on KITTI sequences 00, 05, and 08, totaling 11,373 frames, using an NVIDIA A100 GPU.

Method	Time	FPS
VGGT-Long	65min 08sec	2.91
Pi-long	60min 09sec	3.15
DA3-Streaming	22min 17sec	8.51

Although the current pipeline is not an SLAM system, DA3-Streaming still has a certain degree of accuracy compared to the uncalibrated SLAM method on TUM RGB-D. DA3-Streaming, VGGT-Long and Pi-Long are evaluated with chunk size 120, overlap 60, comparable resolution (~500px-width) and with loop closure.

Methods	AVG	360	desk	desk2	floor	plant	room	rpy	teddy	xyz
Droid-SLAM Uncalibrated	0.163	0.202	0.032	0.091	0.064	0.045	0.918	0.056	0.045	0.012
Mast3r-SLAM Uncalibrated	0.060	0.070	0.035	0.055	0.056	0.035	0.118	0.041	0.114	0.020
VGGT-Long	0.110	0.118	0.058	0.111	0.118	0.071	0.155	0.140	0.120	0.099
Pi-long	0.094	0.115	0.047	0.052	0.160	0.085	0.114	0.143	0.081	0.052
DA3-Streaming	0.087	0.059	0.034	0.042	0.107	0.060	0.105	0.206	0.126	0.044

We also evaluate DA3-Streaming with different chunk sizes on KITTI (w/o 01) with resolution 504x154 and TUM RGB-D with resolution 504x378. Overlap is set to half of chunk size.

	Chunk size	120	90	60	30
KITTI (504x154)	Peak VRAM [GB]	15.9	14.3	12.7	11.5
	ATE RMSE [m]	10.42	9.38	10.64	19.39
TUM RGB-D (504x378)	Peak VRAM [GB]	28.3	25.1	21.2	18.7
	ATE RMSE [m]	0.087	0.091	0.127	0.227

Acknowledgements

Our project is based on VGGT-Long and Depth Anything 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DA3-Streaming: Memory-Efficient Inference for Videos via Chunk Streaming

Updates

Setup, Installation & Running

📮 1 - Clone this project

📦 2 - Environment Setup

Step 1: Dependency Installation

Step 2: Weights Download

System dependencies you may encounter.

🚀 3 - Running the code

4 - Outputs

Basic Outputs

Additional Outputs

Experiment Results

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DA3-Streaming: Memory-Efficient Inference for Videos via Chunk Streaming

Updates

Setup, Installation & Running

📮 1 - Clone this project

📦 2 - Environment Setup

Step 1: Dependency Installation

Step 2: Weights Download

System dependencies you may encounter.

🚀 3 - Running the code

4 - Outputs

Basic Outputs

Additional Outputs

Experiment Results

Acknowledgements