Triton Joseph Projector

This code was created during the CCP SyneRBI and CCPi Hackathon on Efficient integration of SIRF/STIR/CIL with Pytorch, held in April 2025: https://www.ccpsynerbi.ac.uk/events/hackathon-on-pytorch/.

This project drew inspiration from the blog post series by Chris Lattner on democratizing AI compute, particularly the discussion around Triton: https://www.modular.com/blog/democratizing-ai-compute-part-7-what-about-triton-and-python-edsls.

Motivation

The core goal was to implement a computationally expensive projection operator used in iterative tomographic reconstruction – specifically the Joseph forward/back-projector – using Triton. In tomography, projection operations map between the image volume (patient anatomy) and the detector data (sinogram). The Joseph projector is one algorithm for performing this mapping. Due to the large number of rays and voxels, this operation is computationally intensive but also highly parallelisable as there are weak dependencies between data elements (e.g., calculating one ray's projection is largely independent of others), making it ideal for GPU acceleration. Traditionally, high-performance GPU kernels for such tasks are written in CUDA for Nvidia GPUs or potentially HIP for AMD (and Nvidia) GPUs. While mature and capable of extracting maximum performance, CUDA creates vendor lock-in, and maintaining separate codebases increases development effort.

We aimed to explore Triton as an alternative that offers:

Cross-Compatibility: The ability to write a single kernel targeting both Nvidia and AMD GPUs.
Developer Productivity: A potentially simpler development experience using Python.

Triton

Triton https://triton-lang.org/main/index.html is a modern Embedded Domain-Specific Language (eDSL) and compiler designed for writing high-performance GPU kernels directly within Python.

Triton is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware.

Key aspects relevant to this project include:

Python Integration: Kernels writing is embedded with Python syntax. This leverages Python's ease of use and integrates seamlessly with libraries like PyTorch, allowing kernels to operate directly on torch.Tensor objects residing on the GPU. This avoids the explicit memory management (cudaMalloc, cudaMemcpy, etc.) often required in C++/CUDA, reducing complexity and potential bugs.
JIT Compilation: Triton kernels are compiled Just-In-Time (JIT) when first called. Triton's compiler analyzes the Python code and generates highly optimized machine code (e.g., PTX for Nvidia) specifically for the target GPU architecture.
Performance Focus: While being a higher-level language, Triton is designed to generate code competitive with hand-written CUDA. It provides abstractions for managing GPU resources like shared memory and specifying tiling strategies, guiding the compiler to produce efficient parallel execution plans.
Hardware Abstraction (via MLIR): Triton achieves its cross-vendor capability by leveraging MLIR as a compiler backend (see below). The goal is to write one Triton kernel and have the compiler generate efficient code for different GPU architectures.

MLIR (Multi-Level Intermediate Representation)

Triton doesn't directly generate PTX or other hardware-specific code. Instead, it compiles the Python kernel code into its own Triton IR, which is then lowered to MLIR https://mlir.llvm.org/.

MLIR is a modern compiler infrastructure project (originating from the LLVM family, spearheaded by Chris Lattner) designed to address the complexities of compiling diverse software (especially in AI/ML) for diverse hardware. As Lattner envisioned in the blog post linked above:

Could we build a unified representation that could support every AI framework, every hardware backend, and every kind of optimization—from algebraic simplification to polyhedral analysis?

How MLIR helps Triton:

Unified Infrastructure: MLIR provides a common framework with multiple levels of abstraction ("dialects") to represent code. Triton uses MLIR dialects to represent the kernel's computation and parallelism.
Optimisation: Common optimization passes can be developed within the MLIR framework and applied before generating final code for the hardware.
Hardware Targeting: MLIR handles the complex process of "lowering" the high-level representation through various intermediate steps down to something like LLVM IR. LLVM IR can then be compiled into the final machine code for specific backends.

Essentially, Triton provides the productive Python front-end, while MLIR provides the infrastructure for the compiler to use that enables optimisation and retargeting to different hardware, facilitating Triton's goal of democratising high-performance GPU programming. This project serves as a practical exploration of this toolchain for an intensive tomographic image reconstruction operation.

Tests in this repo

We looked at some small tests: validation comparisons of forward/backward projections, and reconstructions after 10000 MLEM steps), timing tests at float32 precision for forward/backward and MLEM steps, and testing of different precisions with an adjointness test.

Validation

Comparison of forward and backward projecting the Shepp-Logan phantom compared with ParallelProj

Reconstructions after 10,000 of MLEM.

Timing

Precision

Precision	Relative Adjointness Test	Timing (ms)
float16	5.71e-6	15.4
float32	9.10e-8	16.1
float64	0.00	76.1

Development containers

This repository uses .devcontainers to to containerise the full development environment, see here: https://containers.dev/. We install on-top of a pytorch image and the details can be found in \.devcontainer.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
3D_shepp_logan.npy		3D_shepp_logan.npy
LICENSE		LICENSE
NOTEBOOK_parallel_proj_validation.ipynb		NOTEBOOK_parallel_proj_validation.ipynb
NOTEBOOK_timing_processing.ipynb		NOTEBOOK_timing_processing.ipynb
NOTEBOOK_triton_joseph_precision.ipynb		NOTEBOOK_triton_joseph_precision.ipynb
NOTEBOOK_triton_joseph_validation.ipynb		NOTEBOOK_triton_joseph_validation.ipynb
README.md		README.md
TIMINGS_run.sh		TIMINGS_run.sh
TIMING_parallel_proj.py		TIMING_parallel_proj.py
TIMING_triton_joseph_proj.py		TIMING_triton_joseph_proj.py
create_test_data.py		create_test_data.py
timing_percentage_difference.png		timing_percentage_difference.png
triton_joseph_proj.py		triton_joseph_proj.py
validation.png		validation.png
validation_10000_MLEM.png		validation_10000_MLEM.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Triton Joseph Projector

Motivation

Triton

MLIR (Multi-Level Intermediate Representation)