Packed backward pass speedup via unrolled camera position indexing #831
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When running 3DGS with
packed=True, pose_opt=Trueand using a CUDA device, the computation time for this campos indexing operation in the backwards pass is huge. This is because the gradients of ALL dirs would need to be accumulated into the gradient of the camera positions (which are much smaller in number), in the backward pass. The pytorch backward cuda kernel for this operation is very expensive due to numerous GPU atomic operations.This PR unrolls the indexing and relies on pytorch's broadcasting for faster backwards pass, while keeping the overall numerical calculation exactly the same.
Training times
I ran the
examples/benchmarks/basic.shon RTX 3090. Major improvement in the training time when pose_opt and packed are true. No major performance effect in the other case.batch_size=1 max_steps=5000 packed=true pose_opt=true
Before this change
After this change
batch_size=1 max_steps=5000 packed=false pose_opt=false
Before this change
After this change