Fast panorama stitching of two images.
Used for stitching two high-resolution (5K) GoPro videos into a single (roughly) 8500x3300 panorama video. This repo simply delivers the high-performance stitching code. Running through each frame of videos and stitching them is not currently part of this repo, since this repo is focused mainly on the stitching code wrt a single frame.
Speed on RTX4090 with 6 laplacian levels: 340 fps
Speed on Jetson with 6 levels: 8 fps
Speed on Jetson with 0 levels: 20 fps
For the Jetson, scaling down the pano config to around 6000 pixels width brings it up to 50 fps
Procedure:
Install Hugin and Enblend:
sudo apt-get install hugin hugin-tools enblend
Maybe install bazelisk
./scripts/install_bazelisk.sh
Build the code
bazelisk build //...
Put the left and right videos into some directory Run the configuration to configure the stitching. It will write some project and mapping files (along with keypoint matching illustrations to) the directory where the videos reside.
python scripts/create_control_points.py <path to left video> <path to right video>
Run the stitching test
./bazel-bin/tests/test_cuda_blend --show --perf --output=myframe.png --directory=<directory where the video was/project files were saved>
Stitched Panorama (CUDA Kernels, hard seam or laplacian blending)

The commands below are tested on the provided assets using the Python from the ubuntu conda env and the built Bazel binaries.
- Generate control points and Hugin mappings (writes into
assets/):
python3 scripts/create_control_points.py --left assets/left.png --right assets/right.png --max-control-points 200
- Run the CUDA stitcher (0 levels = hard seam, fastest):
./bazel-bin/tests/test_cuda_blend --levels=6 --directory=assets --output=assets/pano_left_right.png
- View the result:
viewnior assets/pano_left_right.png &
- Result preview (clickable path):
assets/pano_left_right.png
Example outputs and diagnostics written to assets/:
- Matches:
assets/matches.png - Keypoints:
assets/keypoints.png - Seam file (from enblend):
assets/seam_file.png - Mappings:
assets/mapping_000{0,1}_[xy].tif
Prepare a small working folder and convert the inputs to the naming the demos expect (image0.png, image1.png, image2.png):
mkdir -p assets/three
ffmpeg -y -loglevel error -i assets/weir_1.jpg assets/three/image0.png
ffmpeg -y -loglevel error -i assets/weir_2.jpg assets/three/image1.png
ffmpeg -y -loglevel error -i assets/weir_3.jpg assets/three/image2.png
Create the Hugin project, find control points with LightGlue, optimize, and export remap files (mapping_000i_[xy].tif) and warped layers (mapping_000i.tif):
cd assets/three
python3 ../../scripts/create_control_points_multi.py --inputs image0.png image1.png image2.png --max-control-points 200
Generate the seam mask with multiblend (recommended for 3+). This uses the Bazel external @multiblend:
bazelisk run @multiblend//:multiblend -- --save-seams=seam_file.png -o panorama.tif mapping_????.tif
Alternatively, you can use enblend (also available as Bazel external @enblend) and convert to a 3‑class paletted mask automatically:
python3 ../../scripts/generate_seam.py --directory=$(pwd) --num-images=3 --seam enblend
Stitch using the 3-image path (6 levels) and view it:
cd -
./bazel-bin/tests/test_cuda_blend3 --levels=6 --directory=assets/three --output=assets/three/pano_three_3way.png
viewnior assets/three/pano_three_3way.png &
Alternatively, the generic N-image path works with the same folder (num-images=3):
./bazel-bin/tests/test_cuda_blend_n --levels=6 --num-images=3 --directory=assets/three --output=assets/three/pano_three_n.png
Input previews:
assets/three/image0.pngassets/three/image1.pngassets/three/image2.png
Output previews:
assets/three/pano_three_3way.pngassets/three/pano_three_n.png
You can stitch 2–8 images using the N-image path.
Build:
bazelisk build //tests:test_cuda_blend_n
Run (soft seam example with 4 inputs):
./bazel-bin/tests/test_cuda_blend_n \
--num-images=4 --levels=6 \
--directory=<data_dir> \
--output=out.png --show
Run (hard seam, no pyramid):
./bazel-bin/tests/test_cuda_blend_n \
--num-images=3 --levels=0 \
--directory=<data_dir> \
--output=out.png
Expected files under <data_dir> for N images (0..N-1):
- Input frames:
image0.png,image1.png, ...,image{N-1}.png - Remaps (CV_16U):
mapping_000i_x.tif,mapping_000i_y.tiffor each i - Positions (TIFF tags):
mapping_000i.tif(used to derive canvas placement) - Seam mask (indexed 8-bit paletted):
seam_file.pngwith classes[0..N-1](one class per image)
Two runnable apps use OpenCV to decode/encode entire videos while stitching each frame with the CUDA pano:
//tests:stitch_two_videos//tests:stitch_three_videos
Build:
bazelisk build //tests:stitch_two_videos //tests:stitch_three_videos
Usage (two videos):
./bazel-bin/tests/stitch_two_videos \
--left=<left.mp4> --right=<right.mp4> \
--control=<dir_with_mapping_and_seam> \
--output=stitched_two.mp4 \
--levels=6 --gpu-decode=1 --gpu-encode=1 \
--show \
--show-scaled=0.5
Usage (three videos):
./bazel-bin/tests/stitch_three_videos \
--left=<video0.mp4> --middle=<video1.mp4> --right=<video2.mp4> \
--control=<dir_with_mapping_and_seam> \
--output=stitched_three.mp4 \
--levels=6 --gpu-decode=1 --gpu-encode=1
Notes:
--controlmust point to a folder containing the Hugin-generated remaps and seam:mapping_000{i}_{x,y}.tif,mapping_000{i}.tif, andseam_file.png(2 files for two-video, 3 files for three-video).- The apps attempt GPU decode/encode via OpenCV cudacodec if available; otherwise they fall back to CPU.
- Output codec defaults to
mp4vfor broad compatibility; override with--fourcc=avc1or--fourcc=hevcif supported.
The C++ stitcher supports three backend ports with a shared public pano/kernel API and shared test targets.
The CUDA port is the default backend on NVIDIA systems.
- Build:
bazelisk build --config=cuda //src/... //tests/... - Test:
bazelisk test --config=cuda //src/... //tests/... - CUDA builds use NVCC and compile fatbins for all GPU architectures supported by the installed CUDA toolchain.
The HIP/ROCm port targets AMD GPUs using the same kernel and pano interfaces.
- Prerequisites:
- ROCm installation with
hipcc, HIP headers, andlibamdhip64 - either
ROCM_PATH/HIP_PATHexported or a standard ROCm install path
- ROCm installation with
- Build:
bazelisk build --config=rocm //src/... //tests/... - Test:
bazelisk test --config=rocm //src/... //tests/...
Notes:
- CUDA/HIP runtime differences are bridged via
cupano/gpu/gpu_runtime.handcupano/gpu/gpu_gl_interop.h. - HIP kernel compilation is provided by a
genrulethat invokeshipcc(taggedmanual). - Explicit HIP kernel build target:
bazelisk build --config=rocm //src/cuda:cuda_blend_cuda_lib_hip.
The Vulkan port keeps CUDA-facing kernel APIs stable so pano code and tests run unchanged under --config=vulkan.
- Build:
bazelisk build --config=vulkan //src/... //tests/... - Test:
bazelisk test --config=vulkan //src/... //tests/...
Backend selection:
--config=cuda,--config=rocm, and--config=vulkanset the corresponding backend port.- You can also set
GPU_BACKEND=cuda|rocm|vulkandirectly (for Vulkan this matches--define=backend=vulkan).
A Python port of the two-image and generic N-image stitchers now lives under cupano/.
It mirrors the C++ control-mask loaders and the cudaPano / cudaPanoN orchestration, but runs on PyTorch tensors in BHWC layout.
Key points:
cupano.CudaStitchPano: two-image path with hard seam or Laplacian blend.cupano.CudaStitchPanoN: generic 2..8 image path, including the minimized soft-seam ROI path.cupano.ControlMasks/cupano.ControlMasksN: Python loaders for the same Hugin/enblend mapping outputs used by the C++ code.- Backend selection is exposed as
backend="auto" | "triton". - There is no separate public
python-torchmode anymore;autoselects Triton for supported GPU tensors and otherwise falls back internally. - The implementation targets portability across CUDA and ROCm through PyTorch, with Triton kernels for ROI copy, remap, mask/image pyramid construction, Laplacian generation, reconstruction, and N-way blend when Triton is installed.
- The Python stitchers reuse persistent blend scratch buffers, and on CUDA they can optionally capture steady-state execution with CUDA graphs.
Example:
import cv2
import torch
from cupano import ControlMasks, CudaStitchPano
masks = ControlMasks("assets")
left = torch.from_numpy(cv2.cvtColor(cv2.imread("assets/image0.png"), cv2.COLOR_BGR2BGRA)).float().cuda().unsqueeze(0)
right = torch.from_numpy(cv2.cvtColor(cv2.imread("assets/image1.png"), cv2.COLOR_BGR2BGRA)).float().cuda().unsqueeze(0)
stitcher = CudaStitchPano(batch_size=1, num_levels=6, control_masks=masks, backend="auto", enable_cuda_graphs=True)
out = stitcher.process(left, right)Tests:
pytest -q tests/python/test_pytorch_pano.py
pytest -q tests/python/test_triton_backend.pyReal-image parity harness:
python scripts/compare_pano_parity.py \
--left assets/left.png \
--right assets/right.png \
--levels 0 \
--backend triton \
--work-dir /tmp/cupano_parity \
--build-if-neededThis compares the Python stitcher against the existing C++ binary on the same control-mask directory and writes:
- C++ output
- Python output
- a diff heatmap
- JSON metrics (
max_abs,mean_abs,rmse,psnr)
Synthetic performance benchmark:
python scripts/benchmark_pano.py \
--sizes 768x384 1024x512 1536x768 \
--levels 0 1 6 \
--output-json /tmp/cupano_bench.jsonThis benchmarks the existing two-image C++ binary and the Python Triton backend on the same synthetic control directories and prints a summary table plus JSON.
Pass --disable-cuda-graphs to benchmark the eager Python path instead of the CUDA-graph replay path.
Note on CuTe DSL:
- NVIDIA CuTe DSL is CUDA-only, so it cannot satisfy a single-source CUDA+ROCm requirement.
- The Python port therefore prioritizes a portable PyTorch implementation rather than introducing a CUDA-only CuTe dependency.
- Triton is the cross-vendor kernel path in this repo; on ROCm it depends on a ROCm-capable Triton/PyTorch installation, and on CUDA it uses the regular Triton GPU backend.



