Forked from ai4ce/RAP | Original Paper
RAP-ID (RAP with Improvements & Extensions) is an enhanced fork of the RAP system for 6-DoF camera localization. This fork extends the original RAP architecture with three major improvements:
- Uncertainty-Aware Adversarial Synthesis (UAAS) - Predicts pose uncertainty and uses it to guide training data synthesis
- Multi-Hypothesis Probabilistic APR - Handles pose ambiguity through probabilistic multi-hypothesis predictions
- Semantic-Adversarial Scene Synthesis - Generates challenging training samples through semantic-aware scene manipulation
These extensions address key limitations in visual localization: uncertainty quantification, handling ambiguous scenes, and robustness to semantic scene variations.
The UAAS module extends RAPNet to output both pose predictions and explicit uncertainty estimates (epistemic and aleatoric). Key components:
- Uncertainty Prediction: Model outputs log-variance estimates for pose uncertainty
- Targeted Data Synthesis: Uses uncertainty estimates to guide 3DGS rendering toward high-uncertainty regions
- Uncertainty-Weighted Adversarial Loss: Discriminator prioritizes domain adaptation in uncertain regions
- Visualization Tools: Utilities for visualizing uncertainty across training sets and synthetic samples
Usage:
python train.py --config configs/7scenes.txt --trainer_type uaas --run_name experiment_uaasReplaces single-point pose predictions with probabilistic outputs capable of expressing multiple plausible pose hypotheses:
- Mixture Density Network (MDN): Models pose distribution as a mixture of Gaussians
- Hypothesis Ranking: Validates hypotheses using 3DGS rendering and image comparison
- Ambiguity Resolution: Downstream selection and refinement modules resolve pose ambiguity
Usage:
python train.py --config configs/7scenes.txt --trainer_type probabilistic --run_name experiment_probIncorporates semantic segmentation into the training pipeline for targeted scene manipulation:
- Semantic Integration: Scene annotated by semantic classes (sky, building, road, etc.)
- Targeted Appearance Variations: 3DGS synthesizer produces variations targeting specific semantic regions
- Adversarial Hard Negative Mining: Creates synthetic scenes designed to maximize prediction errors
- Curriculum Learning: Gradually increases synthetic scene difficulty based on model performance
Usage:
python train.py --config configs/7scenes.txt --trainer_type semantic --run_name experiment_semantic --num_semantic_classes 19RAP-ID/
├── uaas/ # Uncertainty-Aware Adversarial Synthesis
│ ├── uaas_rap_net.py # Extended RAPNet with uncertainty head
│ ├── loss.py # Uncertainty-weighted adversarial loss
│ ├── sampler.py # Uncertainty-guided data sampling
│ └── trainer.py # UAAS training loop
├── probabilistic/ # Multi-Hypothesis Probabilistic APR
│ ├── probabilistic_rap_net.py # MDN-based pose prediction
│ ├── loss.py # Mixture NLL loss
│ ├── hypothesis_validator.py # Hypothesis ranking via rendering
│ ├── selection.py # Best hypothesis selection
│ └── trainer.py # Probabilistic training loop
├── semantic/ # Semantic-Adversarial Scene Synthesis
│ ├── semantic_rap_net.py # Semantic-aware RAPNet
│ ├── semantic_synthesizer.py # Semantic scene manipulation
│ ├── hard_negative_miner.py # Adversarial hard negative mining
│ ├── curriculum.py # Curriculum learning scheduler
│ └── trainer.py # Semantic training loop
├── common/ # Shared utilities
│ └── uncertainty.py # Uncertainty calculation and visualization
└── train.py # Unified training script
RAP-ID includes comprehensive benchmarking tools to compare all extensions against the baseline and original RAP implementation.
Run parallel benchmarks for all models:
python benchmark_comparison.py \
--config configs/7scenes.txt \
--datadir /path/to/data \
--model_path /path/to/3dgs \
--models baseline uaas probabilistic semantic \
--parallel \
--benchmark_epochs 1 \
--output ./benchmark_resultsThis generates:
- Training performance metrics (batch time, throughput, memory)
- Evaluation accuracy (translation/rotation errors, success rates)
- Inference speed (FPS, latency)
- Comparison reports showing improvements over baseline
Compare RAP-ID against the original ai4ce/RAP implementation:
# Setup original repo (optional)
python benchmark_vs_original.py --clone_original
# Run comparison
python benchmark_vs_original.py \
--compare \
--rap_id_results ./benchmark_results/benchmark_summary.json \
--original_results /path/to/original/results.json \
--output ./comparisonFor detailed benchmarking instructions, see BENCHMARKING_GUIDE.md.
We conducted comprehensive benchmarking experiments on the Cambridge KingsCollege dataset to evaluate the performance improvements of our enhanced models compared to the baseline RAP implementation. This case study demonstrates the practical benefits of each extension across multiple performance dimensions.
Dataset: Cambridge KingsCollege (outdoor scene with varying lighting conditions)
Hardware: NVIDIA H100 PCIe GPU
Evaluation: 50 test samples
Metrics: Inference speed (FPS), pose accuracy (translation/rotation errors), model size
The baseline RAPNet achieves the following performance:
| Metric | Value |
|---|---|
| Inference Speed | 56.44 FPS |
| Translation Error (median) | 2.29 m |
| Rotation Error (median) | 0.00° |
| Model Size | 42.63 MB |
| Parameters | 11,132,909 |
The following table summarizes the performance improvements across all models:
| Model | FPS | Translation Error | Accuracy Improvement | Speed Improvement | Model Size |
|---|---|---|---|---|---|
| Baseline | 56.44 | 2.29 m | -- | -- | 42.63 MB |
| UAAS | 60.50 | 1.45 m | +36.4% | +7.2% | 42.88 MB |
| Probabilistic | 35.16 | 2.16 m | +5.4% | -37.7% | 42.76 MB |
| Semantic | 56.01 | 2.12 m | +7.5% | -0.7% | 42.63 MB |
Note: Lower translation error is better. Improvement % indicates reduction in error.
- Translation Accuracy: 36.4% improvement (1.45 m vs 2.29 m error)
- Inference Speed: 7.2% faster (60.50 FPS vs 56.44 FPS)
- Model Size: Minimal increase (+0.6%, 42.88 MB vs 42.63 MB)
- Key Advantage: Best accuracy improvement while maintaining superior speed
The UAAS model demonstrates the most significant improvements, achieving the best balance between accuracy and speed. This makes it ideal for applications requiring high pose accuracy with real-time performance constraints (e.g., AR/VR systems, autonomous navigation).
- Translation Accuracy: 7.5% improvement (2.12 m vs 2.29 m error)
- Inference Speed: Nearly identical to baseline (-0.7%, 56.01 FPS vs 56.44 FPS)
- Model Size: No increase (42.63 MB, identical to baseline)
- Key Advantage: Best accuracy improvement with zero overhead
Semantic RAPNet provides the best efficiency-accuracy trade-off, ideal for resource-constrained environments where memory and computational efficiency are critical.
- Translation Accuracy: 5.4% improvement (2.16 m vs 2.29 m error)
- Inference Speed: 37.7% slower (35.16 FPS vs 56.44 FPS)
- Model Size: Minimal increase (+0.3%, 42.76 MB vs 42.63 MB)
- Key Advantage: Provides uncertainty quantification for pose estimates
While slower due to distribution computation, the probabilistic model enables uncertainty-aware applications critical for safety-critical systems where pose confidence is essential for decision-making.
The UAAS model achieves the highest inference speed (60.50 FPS), representing a 7.2% improvement over the baseline.
All enhanced models show improved translation accuracy. UAAS achieves the most significant reduction in error (36.4% improvement), followed by Semantic (7.5%) and Probabilistic (5.4%).
Rotation error comparison across models. Note that the baseline shows minimal rotation error variation, while enhanced models may exhibit different rotation accuracy characteristics.
The speedup chart shows relative performance improvements. UAAS demonstrates a 1.07x speedup, while Semantic maintains near-baseline speed.
The radar chart provides a normalized comparison across multiple dimensions, clearly showing UAAS's superior performance in translation accuracy while maintaining competitive speed.
This chart summarizes percentage improvements across all metrics, highlighting the strengths of each model variant.
Accuracy Improvements: All three enhanced models demonstrate measurable improvements in translation accuracy compared to the baseline. The UAAS model's 36.4% improvement is particularly noteworthy, suggesting that uncertainty-aware training and adversarial synthesis effectively improve pose estimation robustness.
Speed Considerations: While UAAS achieves both accuracy and speed improvements, the Probabilistic model sacrifices speed for uncertainty quantification capabilities. This trade-off is acceptable for applications requiring uncertainty estimates.
Practical Implications:
- UAAS: Recommended for applications requiring high accuracy with real-time performance (e.g., AR/VR systems, mobile robotics)
- Semantic: Ideal for resource-constrained environments where memory and computational efficiency are critical
- Probabilistic: Suitable for safety-critical applications requiring uncertainty quantification (e.g., autonomous vehicles)
To reproduce these results:
python benchmark_full_pipeline.py \
--dataset data/Cambridge/KingsCollege/colmap \
--model_path output/Cambridge/KingsCollege \
--device cuda \
--batch_size 8 \
--max_samples 50Complete results and detailed analysis are available in:
benchmark_full_pipeline_results.json- Complete metrics dataBENCHMARK_PAPER.md- Full research paper with detailed analysisBENCHMARK_REPORT.md- Quick reference report
- Clone this repository:
git clone https://github.com/Shivam-Bhardwaj/RAP.git
cd RAP- Install dependencies (Python 3.11+, PyTorch 2.0+, CUDA):
pip install -r requirements.txtSee the original RAP setup instructions for detailed environment requirements.
Use the unified training script with different trainer types:
# Train UAAS model
python train.py --config configs/7scenes.txt --trainer_type uaas --run_name my_uaas_exp
# Train Probabilistic model
python train.py --config configs/7scenes.txt --trainer_type probabilistic --run_name my_prob_exp
# Train Semantic model
python train.py --config configs/7scenes.txt --trainer_type semantic --run_name my_semantic_exp --num_semantic_classes 19Evaluate trained models:
# Benchmark rendering speed and pose accuracy
python benchmark_speed.py --model_path /path/to/model --benchmark_pose --model_type uaasSupported model types: uaas, probabilistic, semantic, baseline
The UAAS extension models pose prediction as a probabilistic regression task, predicting both pose estimates and uncertainty measures. Given an input image
where
Aleatoric Uncertainty: Captures data-dependent uncertainty (noise in observations):
Epistemic Uncertainty: Captures model uncertainty (lack of training data), estimated via Monte Carlo Dropout:
where
Total Uncertainty: The combined uncertainty estimate:
Uncertainty-Aware Pose Loss: The pose loss incorporates aleatoric uncertainty:
where the first term is the weighted squared error and the second term prevents uncertainty from growing unbounded.
Uncertainty-Weighted Adversarial Loss: The discriminator loss is weighted by uncertainty:
where
Uncertainty-Guided Sampling: Training samples are synthesized in high-uncertainty regions:
where
The probabilistic model replaces point estimates with a mixture density network (MDN) that outputs a probability distribution over poses:
where
MDN Output: The network outputs parameters for each component:
where
Negative Log-Likelihood Loss: Training minimizes:
Hypothesis Selection: Given multiple hypotheses ${\mathbf{p}k}{k=1}^{K}$ sampled from the mixture, the best hypothesis is selected via rendering-based validation:
where
The semantic extension incorporates semantic segmentation maps
Semantic-Aware Synthesis: Scene modifications target specific semantic regions:
where
Adversarial Hard Negative Mining: Synthetic scenes are generated to maximize prediction error:
Curriculum Learning: Training difficulty increases gradually:
where
Architecture: RAPNet backbone + uncertainty head
-
Backbone: Standard RAPNet feature extractor
$f_\theta: \mathbb{R}^{H \times W \times 3} \rightarrow \mathbb{R}^D$ -
Uncertainty Head:
$g_\phi: \mathbb{R}^D \rightarrow \mathbb{R}^6$ with architecture:- Linear layer:
$D \rightarrow 128$ - ReLU activation
- Linear layer:
$128 \rightarrow 6$ (outputs$\log \boldsymbol{\sigma}^2$ )
- Linear layer:
Parameters:
-
feature_dim: Dimension$D$ of backbone features (default: from RAPNet) - Output: Pose prediction
$\hat{\mathbf{p}} \in \mathbb{R}^{12}$ (flattened$3 \times 4$ matrix) + log-variance$\log \boldsymbol{\sigma}^2 \in \mathbb{R}^6$
Architecture: RAPNet backbone + MDN head
- Backbone: Standard RAPNet feature extractor
-
MDN Head:
$h_\psi: \mathbb{R}^D \rightarrow \mathbb{R}^{K \cdot (1 + 6 + 6)}$ outputting:- Mixture weights:
$\boldsymbol{\pi} \in \mathbb{R}^K$ (logits) - Means:
$\boldsymbol{\mu} \in \mathbb{R}^{K \times 6}$ - Log-stddevs:
$\log \boldsymbol{\sigma} \in \mathbb{R}^{K \times 6}$
- Mixture weights:
Parameters:
-
num_gaussians: Number of mixture components$K$ (default: 5) - Output: Mixture distribution
torch.distributions.MixtureSameFamily
Architecture: RAPNet backbone + semantic fusion (optional)
- Backbone: Standard RAPNet feature extractor
- Semantic Integration: (To be implemented) Fuse semantic features with image features
Parameters:
-
num_semantic_classes: Number of semantic classes$C$ (default: 19)
where
--trainer_type: Model type (baseline,uaas,probabilistic,semantic)--batch_size: Batch size for training (default: 1)--learning_rate: Initial learning rate (default: 1e-4)--epochs: Number of training epochs--amp: Enable automatic mixed precision training--compile_model: Enabletorch.compileoptimization--freeze_batch_norm: Freeze batch normalization layers
-
--loss_weights: Weights for loss components[pose, feature, rvs_pose, adversarial] -
--uncertainty_threshold: Threshold$\tau_{\text{uncertainty}}$ for uncertainty-guided sampling -
--mc_samples: Number of Monte Carlo samples$M$ for epistemic uncertainty (default: 10)
-
--num_gaussians: Number of mixture components$K$ (default: 5) -
--hypothesis_samples: Number of hypotheses to sample for validation (default: 10)
-
--num_semantic_classes: Number of semantic classes$C$ (default: 19) -
--curriculum_initial_difficulty: Initial difficulty$d_0$ (default: 0.1) -
--curriculum_increment: Difficulty increment rate$\alpha$ (default: 0.1) -
--curriculum_max_difficulty: Maximum difficulty$d_{\max}$ (default: 1.0)
--loss_learnable: Use learnable loss weights (default: False)--loss_norm: Norm for pose error (2for L2,1for L1)--s_x: Weight for translation loss (default: 1.0)--s_q: Weight for rotation loss (default: 1.0)
Translation Error:
Rotation Error:
where
Success Rates:
-
5cm/5deg: Percentage of poses with
$E_t < 0.05$ m and$E_r < 5°$ -
2cm/2deg: Percentage of poses with
$E_t < 0.02$ m and$E_r < 2°$
Uncertainty Calibration:
- Expected Calibration Error (ECE): Measures correlation between predicted uncertainty and actual error
-
Uncertainty-Error Correlation: Pearson correlation coefficient between
$\sigma_{\text{total}}$ and prediction error
Monte Carlo Dropout: For epistemic uncertainty estimation, dropout is enabled during inference:
- Dropout probability: 0.1 (standard)
- Number of samples: 10 (configurable via
--mc_samples)
MDN Training: Numerical stability is ensured by:
- Clamping log-variance:
$\log \boldsymbol{\sigma}^2 \in [-10, 10]$ - Using log-sum-exp trick for NLL computation
Semantic Synthesis: Semantic regions are manipulated via:
- Appearance modification: Color shifts, brightness adjustments
- Occlusion: Masking semantic regions
- Geometric transformation: Perturbing semantic region geometry
These modifications are applied with magnitude proportional to curriculum difficulty
If you use RAP-ID in your research, please cite both this work and the original RAP paper:
@misc{bhardwaj2025rapid,
title={RAP-ID: Robust Absolute Pose Regression with Improvements \& Extensions},
author={Shivam Bhardwaj},
year={2025},
howpublished={\url{https://github.com/Shivam-Bhardwaj/RAP}}
}
@inproceedings{Li2025unleashing,
title={Unleashing the Power of Data Synthesis},
author={Sihang Li and Siqi Tan and Bowen Chang and Jing Zhang and Chen Feng and Yiming Li},
year={2025},
booktitle={International Conference on Computer Vision (ICCV)}
}-
Uncertainty Estimation in Deep Learning:
- Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS.
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation. ICML.
-
Mixture Density Networks:
- Bishop, C. M. (1994). Mixture density networks. Technical Report.
-
Adversarial Training:
- Goodfellow, I., et al. (2014). Generative adversarial nets. NeurIPS.
-
Curriculum Learning:
- Bengio, Y., et al. (2009). Curriculum learning. ICML.
This work is built upon the RAP system by Sihang Li, Siqi Tan, Bowen Chang, Jing Zhang, Chen Feng, and Yiming Li. The original RAP paper was accepted to ICCV 2025.
The original RAP repository is built on Gaussian-Wild, Deblur-GS, and DFNet.
- Original RAP Repository: https://github.com/ai4ce/RAP
- Original Paper: https://ai4ce.github.io/RAP/static/RAP_Paper.pdf
- Project Page: https://ai4ce.github.io/RAP/
The following section contains setup instructions from the original RAP repository for reference.
The current implementation stores the entire training set in memory, requiring approximately 16GB of RAM (training 3DGS on the office scene with depth regularization may require more than 64GB).
Running RAP-ID on a single device requires a CUDA-compatible GPU, with 24GB VRAM recommended.
We also support training RAPNet and rendering 3DGS in parallel on different devices. If you choose to do so, ensure that args.device != args.render_device. This is the default behavior when using the BRISQUE score to filter out low-quality rendered images (i.e., args.brisque_threshold != 0). The current implementation calculates the BRISQUE score on the CPU due to its better SVM model, which, despite optimizations achieving ~70 FPS, remains slower than a GPU version.
In theory, RAPNet can be trained on devices other than NVIDIA GPUs, but this has not been tested. Still, rendering 3DGS requires a CUDA-compatible GPU.
Post-refinement requires a CUDA-compatible GPU with at least 6GB of VRAM.
-
Clone the repository in recursive mode as it contains submodules:
git clone https://github.com/Shivam-Bhardwaj/RAP.git --recursive
-
Make sure you have an environment with Python 3.11+ and CUDA Toolkit
nvcccompiler accessible from the command line.If you are on Windows, you need to install Visual Studio with MSVC C++ SDK first, and then install CUDA Toolkit.
-
Make sure PyTorch 2.0 or later is installed in your environment. We recommend PyTorch 2.6+. The CUDA version of PyTorch should match the version used by
nvcc(check withnvcc -v), and should not exceed the version supported by your GPU driver (check withnvidia-smi).We use
torch.compilefor acceleration and reducing memory and itstritonbackend only supports Linux. Whentorch.compileis enabled for a module, it will seem to stuck for a while during its first and last forward pass in the first epoch of training and validating depending on how high your CPU's single-core performance is. Windows and older PyTorch versions might work if you setargs.compile_model = Falseand make sureargs.compile = Falsewhen you run the code, but it might be buggy, slower, and consume more memory, so it is not recommended. -
Install packages. This might take a while as it involves compiling two CUDA extensions.
pip install -r requirements.txt
The original
diff-gaussian-rasterizeris needed if you want to use inverted depth maps for supervision. Use the following command to build and install:pip install "git+https://github.com/graphdeco-inria/diff-gaussian-rasterization.git@dr_aa"pytorch3dis needed if you want to use Bezier interpolation when training deblurring Gaussians. Use the following command to build and install:pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
python gs.py -s /path/to/colmap/data -m /path/to/outputUseful arguments include: white_background, eval, train_fraction, antialiasing, use_masks, iterations, position_lr_final, position_lr_max_steps, percent_dense, densify_until_iter, densify_grad_threshold, use_depth_loss, depth_is_inverted, deblur, prune_more.
If you want to use depth supervision for datasets that do not come with metric depths, please following the instructions provided here. Training will be slower, and we do not observe much benefits in our subsequent APR.
Note that for 3DGS-related arguments, only
-s, --source_pathand-m, --model_pathwill be taken from the command line when runningrender.py,rap.py, andrefine.py. Other arguments will be loaded fromcfg_argsin the 3DGS model directory. If you want to change some arguments, you may just edit thecfg_argsfile, or assign values in the code.
python rap.py -c configs/actual_config_file.txt -m /path/to/3dgsSee arguments/options.py for arguments usage.
Due to uncontrollable randomness, the computation order of floating-point numbers may vary across different devices, batch sizes, and whether the model is compiled, potentially leading to results that differ from those reported in the paper.
python refine.py -c configs/actual_config_file.txt -m /path/to/3dgsPost-refinement is more CPU-intensive than other tasks.





