GitHub - cjolivier01/hm-cupano: HockeyMom Fast CUDA Pano Stitching

Fast panorama stitching of two images.

Used for stitching two high-resolution (5K) GoPro videos into a single (roughly) 8500x3300 panorama video. This repo simply delivers the high-performance stitching code. Running through each frame of videos and stitching them is not currently part of this repo, since this repo is focused mainly on the stitching code wrt a single frame.

Speed on RTX4090 with 6 laplacian levels:     340 fps
Speed on Jetson with 6 levels:                  8 fps
Speed on Jetson with 0 levels:                 20 fps

For the Jetson, scaling down the pano config to around 6000 pixels width brings it up to 50 fps

Procedure:

Install Hugin and Enblend:

sudo apt-get install hugin hugin-tools enblend

Maybe install bazelisk

./scripts/install_bazelisk.sh

Build the code

bazelisk build //...

Put the left and right videos into some directory Run the configuration to configure the stitching. It will write some project and mapping files (along with keypoint matching illustrations to) the directory where the videos reside.

python scripts/create_control_points.py <path to left video> <path to right video>

Run the stitching test

./bazel-bin/tests/test_cuda_blend --show --perf --output=myframe.png --directory=<directory where the video was/project files were saved>

Left frame:

Right frame:

Key points (SuperPoint)

Matches (LightGlue)

Stitched Panorama (CUDA Kernels, hard seam or laplacian blending)

Working Examples

The commands below are tested on the provided assets using the Python from the ubuntu conda env and the built Bazel binaries.

1) Two-Image Stitch (assets/left.png + assets/right.png)

Generate control points and Hugin mappings (writes into assets/):

python3 scripts/create_control_points.py --left assets/left.png --right assets/right.png --max-control-points 200

Run the CUDA stitcher (0 levels = hard seam, fastest):

./bazel-bin/tests/test_cuda_blend --levels=6 --directory=assets --output=assets/pano_left_right.png

View the result:

viewnior assets/pano_left_right.png &

Result preview (clickable path): assets/pano_left_right.png

Example outputs and diagnostics written to assets/:

Matches: assets/matches.png
Keypoints: assets/keypoints.png
Seam file (from enblend): assets/seam_file.png
Mappings: assets/mapping_000{0,1}_[xy].tif

2) Three-Image Stitch (assets/weir_1,2,3 → assets/three/)

Prepare a small working folder and convert the inputs to the naming the demos expect (image0.png, image1.png, image2.png):

mkdir -p assets/three
ffmpeg -y -loglevel error -i assets/weir_1.jpg assets/three/image0.png
ffmpeg -y -loglevel error -i assets/weir_2.jpg assets/three/image1.png
ffmpeg -y -loglevel error -i assets/weir_3.jpg assets/three/image2.png

Create the Hugin project, find control points with LightGlue, optimize, and export remap files (mapping_000i_[xy].tif) and warped layers (mapping_000i.tif):

cd assets/three
python3 ../../scripts/create_control_points_multi.py --inputs image0.png image1.png image2.png --max-control-points 200

Generate the seam mask with multiblend (recommended for 3+). This uses the Bazel external @multiblend:

bazelisk run @multiblend//:multiblend -- --save-seams=seam_file.png -o panorama.tif mapping_????.tif

Alternatively, you can use enblend (also available as Bazel external @enblend) and convert to a 3‑class paletted mask automatically:

python3 ../../scripts/generate_seam.py --directory=$(pwd) --num-images=3 --seam enblend

Stitch using the 3-image path (6 levels) and view it:

cd -
./bazel-bin/tests/test_cuda_blend3 --levels=6 --directory=assets/three --output=assets/three/pano_three_3way.png
viewnior assets/three/pano_three_3way.png &

Alternatively, the generic N-image path works with the same folder (num-images=3):

./bazel-bin/tests/test_cuda_blend_n --levels=6 --num-images=3 --directory=assets/three --output=assets/three/pano_three_n.png

Input previews:

assets/three/image0.png
assets/three/image1.png
assets/three/image2.png

Output previews:

assets/three/pano_three_3way.png
assets/three/pano_three_n.png

N-Image Stitching (Arbitrary N)

You can stitch 2–8 images using the N-image path.

Build:

bazelisk build //tests:test_cuda_blend_n

Run (soft seam example with 4 inputs):

./bazel-bin/tests/test_cuda_blend_n \
  --num-images=4 --levels=6 \
  --directory=<data_dir> \
  --output=out.png --show

Run (hard seam, no pyramid):

./bazel-bin/tests/test_cuda_blend_n \
  --num-images=3 --levels=0 \
  --directory=<data_dir> \
  --output=out.png

Expected files under <data_dir> for N images (0..N-1):

Input frames: image0.png, image1.png, ..., image{N-1}.png
Remaps (CV_16U): mapping_000i_x.tif, mapping_000i_y.tif for each i
Positions (TIFF tags): mapping_000i.tif (used to derive canvas placement)
Seam mask (indexed 8-bit paletted): seam_file.png with classes [0..N-1] (one class per image)

Video Stitching Apps

Two runnable apps use OpenCV to decode/encode entire videos while stitching each frame with the CUDA pano:

//tests:stitch_two_videos
//tests:stitch_three_videos

Build:

bazelisk build //tests:stitch_two_videos //tests:stitch_three_videos

Usage (two videos):

./bazel-bin/tests/stitch_two_videos \
  --left=<left.mp4> --right=<right.mp4> \
  --control=<dir_with_mapping_and_seam> \
  --output=stitched_two.mp4 \
  --levels=6 --gpu-decode=1 --gpu-encode=1 \
  --show \
  --show-scaled=0.5

Usage (three videos):

./bazel-bin/tests/stitch_three_videos \
  --left=<video0.mp4> --middle=<video1.mp4> --right=<video2.mp4> \
  --control=<dir_with_mapping_and_seam> \
  --output=stitched_three.mp4 \
  --levels=6 --gpu-decode=1 --gpu-encode=1

Notes:

--control must point to a folder containing the Hugin-generated remaps and seam: mapping_000{i}_{x,y}.tif, mapping_000{i}.tif, and seam_file.png (2 files for two-video, 3 files for three-video).
The apps attempt GPU decode/encode via OpenCV cudacodec if available; otherwise they fall back to CPU.
Output codec defaults to mp4v for broad compatibility; override with --fourcc=avc1 or --fourcc=hevc if supported.

C++ GPU Backend Ports

The C++ stitcher supports three backend ports with a shared public pano/kernel API and shared test targets.

CUDA Port (Default)

The CUDA port is the default backend on NVIDIA systems.

Build: bazelisk build --config=cuda //src/... //tests/...
Test: bazelisk test --config=cuda //src/... //tests/...
CUDA builds use NVCC and compile fatbins for all GPU architectures supported by the installed CUDA toolchain.

HIP/ROCm Port (Experimental)

The HIP/ROCm port targets AMD GPUs using the same kernel and pano interfaces.

Prerequisites:
- ROCm installation with hipcc, HIP headers, and libamdhip64
- either ROCM_PATH / HIP_PATH exported or a standard ROCm install path
Build: bazelisk build --config=rocm //src/... //tests/...
Test: bazelisk test --config=rocm //src/... //tests/...

Notes:

CUDA/HIP runtime differences are bridged via cupano/gpu/gpu_runtime.h and cupano/gpu/gpu_gl_interop.h.
HIP kernel compilation is provided by a genrule that invokes hipcc (tagged manual).
Explicit HIP kernel build target: bazelisk build --config=rocm //src/cuda:cuda_blend_cuda_lib_hip.

Vulkan Port (Experimental)

The Vulkan port keeps CUDA-facing kernel APIs stable so pano code and tests run unchanged under --config=vulkan.

Build: bazelisk build --config=vulkan //src/... //tests/...
Test: bazelisk test --config=vulkan //src/... //tests/...

Backend selection:

--config=cuda, --config=rocm, and --config=vulkan set the corresponding backend port.
You can also set GPU_BACKEND=cuda|rocm|vulkan directly (for Vulkan this matches --define=backend=vulkan).

Python/PyTorch Port

A Python port of the two-image and generic N-image stitchers now lives under cupano/. It mirrors the C++ control-mask loaders and the cudaPano / cudaPanoN orchestration, but runs on PyTorch tensors in BHWC layout.

Key points:

cupano.CudaStitchPano: two-image path with hard seam or Laplacian blend.
cupano.CudaStitchPanoN: generic 2..8 image path, including the minimized soft-seam ROI path.
cupano.ControlMasks / cupano.ControlMasksN: Python loaders for the same Hugin/enblend mapping outputs used by the C++ code.
Backend selection is exposed as backend="auto" | "triton".
There is no separate public python-torch mode anymore; auto selects Triton for supported GPU tensors and otherwise falls back internally.
The implementation targets portability across CUDA and ROCm through PyTorch, with Triton kernels for ROI copy, remap, mask/image pyramid construction, Laplacian generation, reconstruction, and N-way blend when Triton is installed.
The Python stitchers reuse persistent blend scratch buffers, and on CUDA they can optionally capture steady-state execution with CUDA graphs.

Example:

import cv2
import torch

from cupano import ControlMasks, CudaStitchPano

masks = ControlMasks("assets")
left = torch.from_numpy(cv2.cvtColor(cv2.imread("assets/image0.png"), cv2.COLOR_BGR2BGRA)).float().cuda().unsqueeze(0)
right = torch.from_numpy(cv2.cvtColor(cv2.imread("assets/image1.png"), cv2.COLOR_BGR2BGRA)).float().cuda().unsqueeze(0)

stitcher = CudaStitchPano(batch_size=1, num_levels=6, control_masks=masks, backend="auto", enable_cuda_graphs=True)
out = stitcher.process(left, right)

Tests:

pytest -q tests/python/test_pytorch_pano.py
pytest -q tests/python/test_triton_backend.py

Real-image parity harness:

python scripts/compare_pano_parity.py \
  --left assets/left.png \
  --right assets/right.png \
  --levels 0 \
  --backend triton \
  --work-dir /tmp/cupano_parity \
  --build-if-needed

This compares the Python stitcher against the existing C++ binary on the same control-mask directory and writes:

C++ output
Python output
a diff heatmap
JSON metrics (max_abs, mean_abs, rmse, psnr)

Synthetic performance benchmark:

python scripts/benchmark_pano.py \
  --sizes 768x384 1024x512 1536x768 \
  --levels 0 1 6 \
  --output-json /tmp/cupano_bench.json

This benchmarks the existing two-image C++ binary and the Python Triton backend on the same synthetic control directories and prints a summary table plus JSON. Pass --disable-cuda-graphs to benchmark the eager Python path instead of the CUDA-graph replay path.

Note on CuTe DSL:

NVIDIA CuTe DSL is CUDA-only, so it cannot satisfy a single-source CUDA+ROCm requirement.
The Python port therefore prioritizes a portable PyTorch implementation rather than introducing a CUDA-only CuTe dependency.
Triton is the cross-vendor kernel path in this repo; on ROCm it depends on a ROCm-capable Triton/PyTorch installation, and on CUDA it uses the regular Triton GPU backend.

Name		Name	Last commit message	Last commit date
Latest commit History 473 Commits
.vscode		.vscode
assets		assets
bazel		bazel
buildfiles		buildfiles
cupano		cupano
env		env
scripts		scripts
src		src
tests		tests
.bazelrc		.bazelrc
.bazelversion		.bazelversion
.clang-format		.clang-format
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BUILD.bazel		BUILD.bazel
LICENSE		LICENSE
MODULE.bazel		MODULE.bazel
MODULE.bazel.lock		MODULE.bazel.lock
Makefile		Makefile
README.md		README.md
WORKSPACE		WORKSPACE
requirements.txt		requirements.txt
stitch_game.sh		stitch_game.sh
toolchain_env.sh		toolchain_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Working Examples

1) Two-Image Stitch (assets/left.png + assets/right.png)

2) Three-Image Stitch (assets/weir_1,2,3 → assets/three/)

N-Image Stitching (Arbitrary N)

Video Stitching Apps

C++ GPU Backend Ports

CUDA Port (Default)

HIP/ROCm Port (Experimental)

Vulkan Port (Experimental)

Python/PyTorch Port

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Working Examples

1) Two-Image Stitch (assets/left.png + assets/right.png)

2) Three-Image Stitch (assets/weir_1,2,3 → assets/three/)

N-Image Stitching (Arbitrary N)

Video Stitching Apps

C++ GPU Backend Ports

CUDA Port (Default)

HIP/ROCm Port (Experimental)

Vulkan Port (Experimental)

Python/PyTorch Port

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages