Skip to content

Conversation

@gurki-bajwa-ai
Copy link
Contributor

@gurki-bajwa-ai gurki-bajwa-ai commented Nov 13, 2025

Description

When running 3DGS with packed=True, pose_opt=True and using a CUDA device, the computation time for this campos indexing operation in the backwards pass is huge. This is because the gradients of ALL dirs would need to be accumulated into the gradient of the camera positions (which are much smaller in number), in the backward pass. The pytorch backward cuda kernel for this operation is very expensive due to numerous GPU atomic operations.

This PR unrolls the indexing and relies on pytorch's broadcasting for faster backwards pass, while keeping the overall numerical calculation exactly the same.

Training times

I ran the examples/benchmarks/basic.sh on RTX 3090. Major improvement in the training time when pose_opt and packed are true. No major performance effect in the other case.

batch_size=1 max_steps=5000 packed=true pose_opt=true

Before this change

Scene Train Time Number of splats PSNR SSIM
garden 496.45s 2084780 25.25 0.77
bicycle 283.62s 2048744 23.27 0.60
stump 212.59s 2540310 24.32 0.64
bonsai 345.74s 927848 29.14 0.92
counter 370.20s 622163 26.96 0.88

After this change

Scene Train Time Number of splats PSNR SSIM
garden 114.48s 2088360 25.30 0.76
bicycle 95.26s 2026429 23.30 0.60
stump 92.29s 2522318 24.32 0.64
bonsai 99.4s 925365 29.22 0.92
counter 97.83s 622668 26.86 0.88

batch_size=1 max_steps=5000 packed=false pose_opt=false

Before this change

Scene Train Time Number of splats PSNR SSIM
garden 119.15s 2076023 25.63 0.78
bicycle 96.80s 2043134 23.40 0.61
stump 100.20s 2592472 24.51 0.65
bonsai 99.70s 911051 29.48 0.93
counter 97.50s 631954 26.95 0.88

After this change

Scene Train Time Number of splats PSNR SSIM
garden 117.12s 2069605 25.64 0.78
bicycle 96.47s 2047385 23.40 0.61
stump 98.24s 2522734 24.53 0.65
bonsai 98.94s 917302 29.42 0.93
counter 96.19s 631444 26.94 0.88

@gurki-bajwa-ai
Copy link
Contributor Author

@liruilong940607 Please take a look and let me know if the PR text needs some theoretical justification too.

@liruilong940607
Copy link
Collaborator

This is a great finding! But I'm concerned that the current way of writing it (loop over B and C) might leads to very slow speed when B and C is large. Ideally we should vectorize the compute there.

@liruilong940607
Copy link
Collaborator

btw plz run the formatter to pass the test black . gsplat/ tests/ examples/ profiling

@gurki-bajwa-ai
Copy link
Contributor Author

gurki-bajwa-ai commented Nov 14, 2025

I was using python3.10's formatter and hence failed the github tests. Used python3.8's formatter now. It should pass now.

@liruilong940607
Copy link
Collaborator

Merging this now! Thank you for looking into this.

@liruilong940607 liruilong940607 merged commit e35a43a into nerfstudio-project:main Nov 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants