LCPP - LuisaCompute Parallel Primitives

A high-performance GPU parallel primitives library built on LuisaCompute, providing thread-level to device-level parallel operations. LCPP is a header-only C++20 library inspired by CUDA CUB (Cooperative primitives for CUDA C++).

Features

LCPP provides a comprehensive set of parallel primitives organized in a hierarchical architecture:

Device Level  → DeviceReduce, DeviceScan, DeviceRadixSort, DeviceSegmentReduce
       ↓
Agent Layer   → Algorithm policy management (e.g., OneSweepSmallKeyTunedPolicy)
       ↓
Block Level   → BlockReduce, BlockScan, BlockLoad, BlockStore, BlockRadixRank
       ↓
Warp Level    → WarpReduce, WarpScan, WarpExchange (32 threads)
       ↓
Thread Level  → ThreadReduce, ThreadScan

Key Features

Header-only library - Easy integration without compilation
Multi-backend support - Leverages LuisaCompute's backend abstraction (CUDA, Metal, DirectX, Vulkan)
Lazy compilation - Shaders are compiled on-demand using the lazy_compile macro
Flexible algorithms - Multiple algorithm implementations (SHARED_MEMORY, WARP_SHUFFLE)
Type-safe - Modern C++20 with strong type checking
Performance-focused - Optimized policies and tuning parameters

Build

Prerequisites

C++20 compatible compiler
CMake 3.26+ or XMake 3.0+
LuisaCompute (automatically fetched if not found)

XMake (Recommended)

# Configure and build
xmake f -m release
xmake

# Build tests
xmake build -g test

# Run tests
xmake run block_level_test
xmake run warp_level_test
xmake run device_reduce_test
xmake run device_scan_test
xmake run device_segment_reduce
xmake run device_radix_sort_one_sweep

CMake

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

# Run tests
ctest
# Or run individual tests
./tests/block_level_test
./tests/device_scan_test

Quick Start

#include <lcpp/parallel_primitive.h>
using namespace luisa::parallel_primitive;

// Create device and stream
Context ctx{argv[0]};
Device device = ctx.create_device("cuda");
Stream stream = device.create_stream();

// Device-level reduce example
DeviceReduce<> device_reduce;
device_reduce.create(device);

Buffer<int> input = device.create_buffer<int>(1024);
Buffer<int> output = device.create_buffer<int>(1);

CommandList cmdlist;
device_reduce.Sum(cmdlist, stream, input, output, 1024);
stream << cmdlist.commit() << synchronize();

Implemented Features

✅ Thread Level

ThreadReduce - Thread-level reduction operations
ThreadScan - Thread-level scan (prefix sum) operations

✅ Warp Level (32 threads)

WarpReduce - Warp-level reduction (Sum, Min, Max, custom operators)
WarpScan - Warp-level inclusive/exclusive scan
WarpExchange - Data exchange patterns within a warp

✅ Block Level (typically 256 threads)

BlockReduce - Block-level reduction with SHARED_MEMORY and WARP_SHUFFLE algorithms
BlockScan - Block-level inclusive/exclusive scan with prefix callback support
BlockLoad - Efficient block-wide data loading (DIRECT, STRIPED modes)
BlockStore - Efficient block-wide data storing
BlockExchange - Data rearrangement within a block
BlockRadixRank - Ranking operations for radix sort
BlockDiscontinuity - Flag head/tail discontinuities in sequences

✅ Device Level

DeviceReduce - Device-wide reduction (Sum, Min, Max, custom operators)
DeviceScan - Device-wide inclusive/exclusive scan with decoupled look-back
DeviceRadixSort - Radix sort with OneSweep algorithm (SortKeys, SortPairs)
DeviceSegmentReduce - Segmented reduction operations
DeviceHistogram - Histogram computation
DeviceFor - Parallel for-loop utilities

TODO List (Compared to CUDA CUB)

This section tracks missing features compared to the CUDA CUB library. Contributions are welcome!

Priority 1: High-Priority Device Operations

DeviceSelect

DeviceSelect::Flagged - Select items based on selection flags
DeviceSelect::If - Select items based on predicate
DeviceSelect::Unique - Select unique items from input sequence
DeviceSelect::UniqueByKey - Select unique keys from key-value pairs

DevicePartition

DevicePartition::Flagged - Partition based on selection flags
DevicePartition::If - Partition based on predicate

DeviceReduce Extensions

DeviceReduce::ReduceByKey - Reduce segments defined by keys
DeviceReduce::ArgMin - Find minimum value and its index
DeviceReduce::ArgMax - Find maximum value and its index

DeviceScan Extensions

DeviceScan::InclusiveScanByKey - Segmented inclusive scan with keys
DeviceScan::ExclusiveScanByKey - Segmented exclusive scan with keys
Complete key-based scan implementation (partial implementation exists)

Priority 2: Sorting and Merging

DeviceMergeSort

DeviceMergeSort::SortKeys - Stable merge sort for keys
DeviceMergeSort::SortPairs - Stable merge sort for key-value pairs
DeviceMergeSort::StableSortKeys - Guaranteed stable sort
DeviceMergeSort::StableSortPairs - Guaranteed stable sort for pairs

DeviceSegmentedSort

DeviceSegmentedRadixSort - Radix sort within segments
DeviceSegmentedSort - General sorting within segments

BlockMergeSort

BlockMergeSort::Sort - Block-level stable merge sort
BlockMergeSort::SortBlockedToStriped - Sort with layout transformation

Priority 3: Advanced Operations

DeviceRunLengthEncode

DeviceRunLengthEncode::Encode - Run-length encoding
DeviceRunLengthEncode::NonTrivialRuns - Encode runs with length > 1

DeviceAdjacentDifference

DeviceAdjacentDifference::SubtractLeft - In-place left subtraction
DeviceAdjacentDifference::SubtractLeftCopy - Copy with left subtraction
DeviceAdjacentDifference::SubtractRight - In-place right subtraction
DeviceAdjacentDifference::SubtractRightCopy - Copy with right subtraction

DeviceSpmv (Sparse Matrix Operations)

DeviceSpmv::CsrMV - Sparse matrix-vector multiplication (CSR format)

DeviceCopy

DeviceCopy::Batched - Batched memory copy operations

Priority 4: Block-Level Extensions

BlockRadixSort

BlockRadixSort::Sort - Complete radix sort (not just ranking)
BlockRadixSort::SortDescending - Descending radix sort
BlockRadixSort::SortBlockedToStriped - Sort with layout change

BlockHistogram

BlockHistogram::Composite - Block-level histogram computation
BlockHistogram::Init - Initialize histogram bins

BlockAdjacentDifference

BlockAdjacentDifference::SubtractLeft - Block-level left subtraction
BlockAdjacentDifference::SubtractRight - Block-level right subtraction
BlockAdjacentDifference::FlagHeads - Flag heads of segments
BlockAdjacentDifference::FlagTails - Flag tails of segments

BlockRunLengthDecode

BlockRunLengthDecode::RunLengthDecode - Decode run-length encoded data

BlockShuffle

BlockShuffle::Up - Shuffle data up within block
BlockShuffle::Down - Shuffle data down within block
BlockShuffle::Offset - Shuffle data by offset

Priority 5: Warp-Level Extensions

WarpMergeSort

WarpMergeSort::Sort - Warp-level stable merge sort

WarpLoad/WarpStore

WarpLoad - Optimized warp-level data loading
WarpStore - Optimized warp-level data storing

WarpScan Extensions

WarpScan::Broadcast - Broadcast value across warp

WarpReduce Extensions

WarpReduce::HeadSegmentedReduce - Segmented reduce with head flags
WarpReduce::TailSegmentedReduce - Segmented reduce with tail flags

Priority 6: Thread-Level Extensions

ThreadLoad/ThreadStore

ThreadLoad - Thread-level load with cache modifiers (CG, CA, CS, CV)
ThreadStore - Thread-level store with cache modifiers (CG, CS, WT, WB)

Algorithm Improvements

Add SHARED_MEMORY implementations for Warp operations (currently only WARP_SHUFFLE)
Add TRANSPOSE and WARP_TRANSPOSE modes for BlockLoad
Add VECTORIZE mode for BlockLoad/BlockStore
Optimize policies for different GPU architectures
Add support for 64-bit indexing for large data

Architecture

Core Design Patterns

LuisaModule Base Class - All parallel operations inherit from LuisaModule, providing unified type definitions and shader compilation interface
Lazy Shader Compilation - Shaders are compiled on first use via lazy_compile macro
Policy-Based Configuration - Algorithm strategies configured through Policy template parameters
Multiple Algorithm Support - Block operations support both SHARED_MEMORY and WARP_SHUFFLE algorithms

Directory Structure

src/lcpp/
├── device/           # Device-level operations
│   ├── details/      # Implementation details
│   └── device_*.h    # Public interfaces
├── agent/            # Algorithm policies and tuning
├── block/            # Block-level operations (256 threads)
│   └── detail/       # Implementation details
├── warp/             # Warp-level operations (32 threads)
│   └── details/      # Implementation details
├── thread/           # Thread-level operations
├── common/           # Common utilities and type traits
└── runtime/          # LuisaModule base and core definitions

Key Files

src/lcpp/parallel_primitive.h - Main header including all primitives
src/lcpp/runtime/core.h - LuisaModule base class and lazy_compile macro
src/lcpp/agent/policy.h - Tuning policies and parameters

Code Style

Standard: LLVM style with modifications (see .clang-format)
Line Width: 100 characters
Indentation: 4 spaces (no tabs)
Language: C++20

Dependencies

LuisaCompute - GPU compute framework (automatically fetched via CMake FetchContent)
boost_ut - Unit testing framework
cpptrace - Stack tracing utility

Documentation

For detailed API documentation, see the header files in src/lcpp/. Each primitive class is documented with:

Template parameters
Member functions
Algorithm options
Usage examples

Contributing

Contributions are welcome! Please feel free to:

Pick an item from the TODO list
Implement the feature following existing code patterns
Add tests in the tests/ directory
Submit a pull request

Implementation Guidelines

Follow the existing class hierarchy (inherit from LuisaModule)
Use lazy_compile for shader compilation
Support multiple algorithm variants when applicable
Add comprehensive tests
Document public APIs

References

License

[Specify your license here]

Acknowledgments

Built on LuisaCompute by LuisaGroup
Inspired by NVIDIA's CUB library

Sources:

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
src/lcpp		src/lcpp
tests		tests
xmake		xmake
.clang-format		.clang-format
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
xmake.lua		xmake.lua

License

Ligo04/lc_parallel_primitive

Folders and files

Latest commit

History

Repository files navigation

LCPP - LuisaCompute Parallel Primitives

Features

Key Features

Build

Prerequisites

XMake (Recommended)

CMake

Quick Start

Implemented Features

✅ Thread Level

✅ Warp Level (32 threads)

✅ Block Level (typically 256 threads)

✅ Device Level

TODO List (Compared to CUDA CUB)

Priority 1: High-Priority Device Operations

DeviceSelect

DevicePartition

DeviceReduce Extensions

DeviceScan Extensions

Priority 2: Sorting and Merging

DeviceMergeSort

DeviceSegmentedSort

BlockMergeSort

Priority 3: Advanced Operations

DeviceRunLengthEncode

DeviceAdjacentDifference

DeviceSpmv (Sparse Matrix Operations)

DeviceCopy

Priority 4: Block-Level Extensions

BlockRadixSort

BlockHistogram

BlockAdjacentDifference

BlockRunLengthDecode

BlockShuffle

Priority 5: Warp-Level Extensions

WarpMergeSort

WarpLoad/WarpStore

WarpScan Extensions

WarpReduce Extensions

Priority 6: Thread-Level Extensions

ThreadLoad/ThreadStore

Algorithm Improvements

Architecture

Core Design Patterns

Directory Structure

Key Files

Code Style

Dependencies

Documentation

Contributing

Implementation Guidelines

References

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages