A high-performance GPU parallel primitives library built on LuisaCompute, providing thread-level to device-level parallel operations. LCPP is a header-only C++20 library inspired by CUDA CUB (Cooperative primitives for CUDA C++).
LCPP provides a comprehensive set of parallel primitives organized in a hierarchical architecture:
Device Level → DeviceReduce, DeviceScan, DeviceRadixSort, DeviceSegmentReduce
↓
Agent Layer → Algorithm policy management (e.g., OneSweepSmallKeyTunedPolicy)
↓
Block Level → BlockReduce, BlockScan, BlockLoad, BlockStore, BlockRadixRank
↓
Warp Level → WarpReduce, WarpScan, WarpExchange (32 threads)
↓
Thread Level → ThreadReduce, ThreadScan
- Header-only library - Easy integration without compilation
- Multi-backend support - Leverages LuisaCompute's backend abstraction (CUDA, Metal, DirectX, Vulkan)
- Lazy compilation - Shaders are compiled on-demand using the
lazy_compilemacro - Flexible algorithms - Multiple algorithm implementations (SHARED_MEMORY, WARP_SHUFFLE)
- Type-safe - Modern C++20 with strong type checking
- Performance-focused - Optimized policies and tuning parameters
- C++20 compatible compiler
- CMake 3.26+ or XMake 3.0+
- LuisaCompute (automatically fetched if not found)
# Configure and build
xmake f -m release
xmake
# Build tests
xmake build -g test
# Run tests
xmake run block_level_test
xmake run warp_level_test
xmake run device_reduce_test
xmake run device_scan_test
xmake run device_segment_reduce
xmake run device_radix_sort_one_sweepmkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
# Run tests
ctest
# Or run individual tests
./tests/block_level_test
./tests/device_scan_test#include <lcpp/parallel_primitive.h>
using namespace luisa::parallel_primitive;
// Create device and stream
Context ctx{argv[0]};
Device device = ctx.create_device("cuda");
Stream stream = device.create_stream();
// Device-level reduce example
DeviceReduce<> device_reduce;
device_reduce.create(device);
Buffer<int> input = device.create_buffer<int>(1024);
Buffer<int> output = device.create_buffer<int>(1);
CommandList cmdlist;
device_reduce.Sum(cmdlist, stream, input, output, 1024);
stream << cmdlist.commit() << synchronize();- ThreadReduce - Thread-level reduction operations
- ThreadScan - Thread-level scan (prefix sum) operations
- WarpReduce - Warp-level reduction (Sum, Min, Max, custom operators)
- WarpScan - Warp-level inclusive/exclusive scan
- WarpExchange - Data exchange patterns within a warp
- BlockReduce - Block-level reduction with SHARED_MEMORY and WARP_SHUFFLE algorithms
- BlockScan - Block-level inclusive/exclusive scan with prefix callback support
- BlockLoad - Efficient block-wide data loading (DIRECT, STRIPED modes)
- BlockStore - Efficient block-wide data storing
- BlockExchange - Data rearrangement within a block
- BlockRadixRank - Ranking operations for radix sort
- BlockDiscontinuity - Flag head/tail discontinuities in sequences
- DeviceReduce - Device-wide reduction (Sum, Min, Max, custom operators)
- DeviceScan - Device-wide inclusive/exclusive scan with decoupled look-back
- DeviceRadixSort - Radix sort with OneSweep algorithm (SortKeys, SortPairs)
- DeviceSegmentReduce - Segmented reduction operations
- DeviceHistogram - Histogram computation
- DeviceFor - Parallel for-loop utilities
This section tracks missing features compared to the CUDA CUB library. Contributions are welcome!
-
DeviceSelect::Flagged- Select items based on selection flags -
DeviceSelect::If- Select items based on predicate -
DeviceSelect::Unique- Select unique items from input sequence -
DeviceSelect::UniqueByKey- Select unique keys from key-value pairs
-
DevicePartition::Flagged- Partition based on selection flags -
DevicePartition::If- Partition based on predicate
-
DeviceReduce::ReduceByKey- Reduce segments defined by keys -
DeviceReduce::ArgMin- Find minimum value and its index -
DeviceReduce::ArgMax- Find maximum value and its index
-
DeviceScan::InclusiveScanByKey- Segmented inclusive scan with keys -
DeviceScan::ExclusiveScanByKey- Segmented exclusive scan with keys - Complete key-based scan implementation (partial implementation exists)
-
DeviceMergeSort::SortKeys- Stable merge sort for keys -
DeviceMergeSort::SortPairs- Stable merge sort for key-value pairs -
DeviceMergeSort::StableSortKeys- Guaranteed stable sort -
DeviceMergeSort::StableSortPairs- Guaranteed stable sort for pairs
-
DeviceSegmentedRadixSort- Radix sort within segments -
DeviceSegmentedSort- General sorting within segments
-
BlockMergeSort::Sort- Block-level stable merge sort -
BlockMergeSort::SortBlockedToStriped- Sort with layout transformation
-
DeviceRunLengthEncode::Encode- Run-length encoding -
DeviceRunLengthEncode::NonTrivialRuns- Encode runs with length > 1
-
DeviceAdjacentDifference::SubtractLeft- In-place left subtraction -
DeviceAdjacentDifference::SubtractLeftCopy- Copy with left subtraction -
DeviceAdjacentDifference::SubtractRight- In-place right subtraction -
DeviceAdjacentDifference::SubtractRightCopy- Copy with right subtraction
-
DeviceSpmv::CsrMV- Sparse matrix-vector multiplication (CSR format)
-
DeviceCopy::Batched- Batched memory copy operations
-
BlockRadixSort::Sort- Complete radix sort (not just ranking) -
BlockRadixSort::SortDescending- Descending radix sort -
BlockRadixSort::SortBlockedToStriped- Sort with layout change
-
BlockHistogram::Composite- Block-level histogram computation -
BlockHistogram::Init- Initialize histogram bins
-
BlockAdjacentDifference::SubtractLeft- Block-level left subtraction -
BlockAdjacentDifference::SubtractRight- Block-level right subtraction -
BlockAdjacentDifference::FlagHeads- Flag heads of segments -
BlockAdjacentDifference::FlagTails- Flag tails of segments
-
BlockRunLengthDecode::RunLengthDecode- Decode run-length encoded data
-
BlockShuffle::Up- Shuffle data up within block -
BlockShuffle::Down- Shuffle data down within block -
BlockShuffle::Offset- Shuffle data by offset
-
WarpMergeSort::Sort- Warp-level stable merge sort
-
WarpLoad- Optimized warp-level data loading -
WarpStore- Optimized warp-level data storing
-
WarpScan::Broadcast- Broadcast value across warp
-
WarpReduce::HeadSegmentedReduce- Segmented reduce with head flags -
WarpReduce::TailSegmentedReduce- Segmented reduce with tail flags
-
ThreadLoad- Thread-level load with cache modifiers (CG, CA, CS, CV) -
ThreadStore- Thread-level store with cache modifiers (CG, CS, WT, WB)
- Add SHARED_MEMORY implementations for Warp operations (currently only WARP_SHUFFLE)
- Add TRANSPOSE and WARP_TRANSPOSE modes for BlockLoad
- Add VECTORIZE mode for BlockLoad/BlockStore
- Optimize policies for different GPU architectures
- Add support for 64-bit indexing for large data
- LuisaModule Base Class - All parallel operations inherit from
LuisaModule, providing unified type definitions and shader compilation interface - Lazy Shader Compilation - Shaders are compiled on first use via
lazy_compilemacro - Policy-Based Configuration - Algorithm strategies configured through Policy template parameters
- Multiple Algorithm Support - Block operations support both
SHARED_MEMORYandWARP_SHUFFLEalgorithms
src/lcpp/
├── device/ # Device-level operations
│ ├── details/ # Implementation details
│ └── device_*.h # Public interfaces
├── agent/ # Algorithm policies and tuning
├── block/ # Block-level operations (256 threads)
│ └── detail/ # Implementation details
├── warp/ # Warp-level operations (32 threads)
│ └── details/ # Implementation details
├── thread/ # Thread-level operations
├── common/ # Common utilities and type traits
└── runtime/ # LuisaModule base and core definitions
src/lcpp/parallel_primitive.h- Main header including all primitivessrc/lcpp/runtime/core.h-LuisaModulebase class andlazy_compilemacrosrc/lcpp/agent/policy.h- Tuning policies and parameters
- Standard: LLVM style with modifications (see
.clang-format) - Line Width: 100 characters
- Indentation: 4 spaces (no tabs)
- Language: C++20
- LuisaCompute - GPU compute framework (automatically fetched via CMake FetchContent)
- boost_ut - Unit testing framework
- cpptrace - Stack tracing utility
For detailed API documentation, see the header files in src/lcpp/. Each primitive class is documented with:
- Template parameters
- Member functions
- Algorithm options
- Usage examples
Contributions are welcome! Please feel free to:
- Pick an item from the TODO list
- Implement the feature following existing code patterns
- Add tests in the
tests/directory - Submit a pull request
- Follow the existing class hierarchy (inherit from
LuisaModule) - Use
lazy_compilefor shader compilation - Support multiple algorithm variants when applicable
- Add comprehensive tests
- Document public APIs
[Specify your license here]
- Built on LuisaCompute by LuisaGroup
- Inspired by NVIDIA's CUB library
Sources: