This document provides guidelines for building, testing, and contributing to the CCCL repository. It is primarily written for agentic AIs, but the information is also useful for CCCL developers.
CCCL is a collection of CUDA C++ libraries and Python packages:
- libcudacxx — CUDA C++ Standard Library
- CUB — Block-level primitives
- Thrust — High-level parallel algorithms
- cudax — Experimental features
- C Parallel Library — C bindings for CCCL algorithms
- Python CCCL packages (
cuda-cccl) — Python bindings for parallel and cooperative primitives
The repository uses CMake with the Ninja generator and provides standardized presets for consistent builds.
For a given task, you should:
- Research. Search the web, read existing code, look up system/dependency headers / implementations of related functionality. Figure out best practices and common pitfalls. Look for existing tests of the functionality; if none exist, plan a new test that integrates with the relevant existing testing frameworks.
- Plan. Create a high-level plan to implement the requested feature.
- Review and Refine plan. Look for pitfalls, find ways to smooth out rough edges. Verify any assumptions, edgecases, or identified pitfalls. Repeat until the plan is solid.
- Gather consistency context. Look at similar code (sibling classes if possible, otherwise just related source files) to learn the style and patterns used in the project. Consistency is important -- similar features should be organized and implemented similarly. Naming conventions should be followed.
- If requested: Present the plan. Only do this if the user asks for a plan to do something -- if they just ask you implement something without requesting a plan, skip this step.
- Draft. Implement the requested task to the best of your ability.
- Review and Refine. Read through your changes. Verify that API calls are correct. Assess clarity, performance, and readability. Iterate as needed.
- Style check. Ensure that your changes follow style and naming conventions.
- Build and test. Once you're confident that your changes are functionally and stylistically correct start build, test, and iterate cycles. If you don't have permissions to do these, ask the user to run specific build/test commands for you.
Codex cloud instances cannot:
- Run Docker containers with devcontainer scripts
- Access GPUs or run GPU-dependent tests
All CCCL subprojects are computationally expensive to build and test. Use the provided helper scripts to minimize work and target only what you need.
Presets are defined in CMakePresets.json. Names follow a project or <project>-cpp<std> format, such as cub-cpp20, thrust-cpp17, or libcudacxx. Use cmake --list-presets to view available options. Build trees are placed under build/${CCCL_BUILD_INFIX}/${PRESET}.
Launches a container configured with a CUDA Toolkit and host compiler. First startup may take time, but cached environments are faster. In agent environments, container launches may not be supported. To check if you are already inside a container, verify if CCCL_BUILD_INFIX is set.
Common options:
-d, --docker— Run without VSCode (required for agents)--cuda <version>— Select CUDA Toolkit (optional)--cuda-ext— Use a docker image with extended CTK libraries--host <compiler>— Select host compiler (optional)--gpus <request>— GPU devices to add to the container (useallto pass all GPUs)-e/--env,-v/--volume— Environment variables / volume mounts-- <script>— Run script inside container after setup
Example:
.devcontainer/launch.sh -d --cuda 13.1 --host gcc14 -- <script> [args...]Configures, builds, and tests selected Ninja, CTest, or lit targets. Many tests require GPUs. Options that generally work without GPUs include --preset, --cmake-options, --configure-override, --build-targets, --lit-precompile-tests, and --custom-test-cmd.
Key options:
--preset <name>— Use a CMake preset--cmake-options <str>— Extra CMake arguments--configure-override <cmd>— Custom configuration command--build-targets "<targets>"— Space-separated Ninja targets--ctest-targets "<regex>"— Regex for CTest targets (may fail without GPUs)--lit-precompile-tests "<paths>"— Precompile specified libcudacxx lit tests (paths are relative tolibcudacxx/test/libcudacxx/)--lit-tests "<paths>"— Run specified libcudacxx lit tests (also relative tolibcudacxx/test/libcudacxx/)--custom-test-cmd "<cmd>"— Run arbitrary command after tests
Wraps git bisect with the build/test helper. Useful for identifying regression commits. Can take a very long time—minimize scope by restricting build/test targets.
Extra options:
--good-ref <rev>— Known good commit/tag, or-Ndfor origin/main N days ago (default: latest release)--bad-ref <rev>— Known bad commit/tag, or-Nd(default: origin/main)
See docs/cccl/development/build_and_bisect_tools.rst for details.
Always prefer targeted builds and tests, as full builds are time-consuming. If required tools or hardware are unavailable, note this in the PR but run as many relevant tests as possible.
- CUB (
cub/):
ci/util/build_and_test_targets.sh \
--preset cub-cpp20 \
--build-targets "cub.cpp20.test.iterator" \
--ctest-targets "cub.cpp20.test.iterator"- Thrust (
thrust/):
ci/util/build_and_test_targets.sh \
--preset thrust-cpp20 \
--build-targets "thrust.cpp20.test.reduce" \
--ctest-targets "thrust.cpp20.test.reduce"- libcudacxx (
libcudacxx/): Avoid the expensivelibcudacxx.cpp20.precompile.lit. Instead, precompile and run a small set of lit tests:
ci/util/build_and_test_targets.sh \
--preset libcudacxx \
--lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
--lit-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"- CUDA Experimental (
cudax/):
ci/util/build_and_test_targets.sh \
--preset cudax \
--build-targets "cudax.cpp20.test.async_buffer" \
--ctest-targets "cudax.cpp20.test.async_buffer"- C Parallel API (
c/parallel/):
ci/util/build_and_test_targets.sh \
--preset cccl-c-parallel \
--build-targets "cccl.c.test.reduce" \
--ctest-targets "cccl.c.test.reduce"
⚠️ Important: Full builds are costly. Always allow 60+ minutes for builds and 30+ minutes for tests. Do not cancel once started.
Use scripts like:
./ci/build_cub.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
./ci/build_thrust.sh [-cxx clang++] [-std 17] [-arch "75;80;90;120"]
./ci/build_libcudacxx.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
./ci/build_cudax.sh [-cxx g++] [-std 20] [-arch "75;80;90;120"]
./ci/build_cccl_c_parallel.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
./ci/build_cuda_cccl_python.sh -py-version 3.10<XX>— Generate PTX and SASS<XX-real>— Generate only SASS<XX-virtual>— Generate only PTXnative— Detect host GPUall-major-cccl— Default for PR builds
⚠️ Requires an NVIDIA GPU. Tests take 15+ minutes. Use targeted testing whenever possible.
Examples:
./ci/test_cub.sh -cxx g++ -std 17 -arch "75;80;90;120"
./ci/test_thrust.sh -cxx g++ -std 17 -arch "75;80;90;120"
./ci/test_libcudacxx.sh -cxx g++ -std 17 -arch "75;80;90;120"
./ci/test_cudax.sh -cxx g++ -std 20 -arch "75;80;90;120"
ctest --preset=cub-cpp17Options:
-compute-sanitizer-memcheck— Run with memory checking or other compute-sanitizer tools (not all projects support this)
Python components require different parameters than C++ builds. Use -py-version instead of compiler flags.
Supported versions: 3.10, 3.11, 3.12, 3.13
- cuda.compute — Device-level algorithms, iterators, custom GPU types
- cuda.coop — Block/warp-level primitives
- cuda.cccl.headers — Programmatic access to headers
From PyPI:
pip install cuda-cccl[cu13] # or [cu12] for CTK 12.XFrom conda-forge:
conda install -c conda-forge cccl-pythonFrom source:
git clone https://github.com/NVIDIA/cccl.git
cd cccl/python/cuda_cccl
pip install -e .[test-cu13] # or [test-cu12] for CTK 12.XRequirements:
- Python 3.10+
- CUDA Toolkit 12.x or 13.x
- NVIDIA GPU (CC 6.0+)
- Base dependencies:
numba>=0.60.0,numpy,cuda-pathfinder>=1.2.3,cuda-core,typing_extensions - CUDA extras:
cuda-bindings+cuda-toolkit+numba-cudaviacuda-cccl[cu12]orcuda-cccl[cu13]
import cuda.compute
result = cuda.compute.reduce_into(input_array, output_scalar, init_val, binary_op)
from cuda import coop
@cuda.jit
def kernel(data):
coop.block.reduce(data, binary_op)
import cuda.cccl.headers as headers
include_paths = headers.get_include_paths()./ci/build_cuda_cccl_python.sh -py-version 3.10
./ci/test_cuda_compute_python.sh -py-version 3.10
./ci/test_cuda_coop_python.sh -py-version 3.10
./ci/test_cuda_cccl_headers_python.sh -py-version 3.10
./ci/test_cuda_cccl_examples_python.sh -py-version 3.10Test organization:
tests/compute— Algorithms and iteratorstests/coop— Cooperative primitivestests/headers— Header integrationtest_examples.py— Runs compute/coop examples
Use this test when asked to check for SASS changes between commits, branches or a local changeset.
Detect relevant changes in generated CUDA machine code (i.e. SASS) while filtering noise from addresses, symbols, metadata, etc. Any non-trivial change must be detected.
- Compiled binary under test
- The CUDA SM architectures to compile for. Try to detect this from the code and offer the user a list of suggestions. The user must conform or provide this list.
- Baseline disassembly (from the previous commit/branch or the current commit without the changes in the working copy).
- Comparison disassembly (form the current commit/branch or the current commit with the changes in the working copy).
- By default, prefer
cuobjdump -sassto inspect SASS changes. Usecuobjdump -ptxif the request is to check for PTX changes instead.
Apply these transforms to both baseline and candidate listings before diffing. Write the normalized listings to separate files.
- Remove addresses/offsets/hex location prefixes.
- Remove build IDs, timestamps, absolute paths, temp directories, and compiler banners.
- Normalize whitespace and alignment to single spaces.
- Remove empty lines and purely comment lines.
Ignore as trivial:
- Register renaming with identical instruction sequence and operands.
- Pure label renumbering or reordering of identical basic blocks.
- Formatting-only differences or reordered symbol tables.
- If any non-trivial change was detected, the top 5 regions where a non-trivial change was detected, including the name of the kernel they appeared in.
- A short summary of the diff type (opcode change, memory access size change, size delta, control-flow, etc.).
- Explicitly state if only noise was detected after normalization.
- If you are not sure if the differences are impactful, show it and ask the user for guidance.
- Keep the disassembly dumps available for reference and show the command to the user to generate a diff.
See ci-overview.md for detailed examples and troubleshooting guidance.
CCCL's CI is built on GitHub Actions and relies on a dynamically generated job matrix plus several helper scripts.
-
ci/matrix.yaml- Declares build and test jobs for
pull_request,nightly, andweeklyworkflows. - Pull request (PR) runs typically spawn ~250 jobs.
- To reduce overhead, you can add an override matrix in
workflows.override. This limits the PR CI run to a targeted subset of jobs. Overrides are recommended when:- Changes touch high-dependency areas (e.g. top-level CI/devcontainers, libcudacxx, thrust, CUB). See
ci/inspect_changes.pyfor dependency information. - A smaller subset of jobs is enough to validate the change (e.g. infra changes, targeted fixes).
- Changes touch high-dependency areas (e.g. top-level CI/devcontainers, libcudacxx, thrust, CUB). See
- Important rules:
- PR merges are blocked while an override matrix is active.
- The override must be reset to empty (not removed) before merging.
- Only add overrides when starting a new draft that qualifies; never remove one without being asked.
- Declares build and test jobs for
-
.github/actions/workflow-build/- Runs
build-workflow.py. - Reads
ci/matrix.yamland prunes jobs usingci/inspect_changes.py. - Calls
prepare-workflow-dispatch.pyto produce a formatted job matrix for dispatch.
- Runs
-
.github/actions/workflow-run-job-{linux,windows}/- Runs a single matrix job inside a devcontainer.
-
.github/actions/workflow-results/- Aggregates artifacts and results.
- Marks workflow as failed if any job fails or an override matrix is present.
-
.github/workflows/ci-workflow-{pull-request,nightly,weekly}.yml- Top-level GitHub Actions workflows invoking CI.
-
ci/inspect_changes.py- Detects which subprojects changed between commits.
- Defines internal dependencies between CCCL projects. If a project is marked dirty, all dependent projects are also marked dirty and tested.
- Allows
build-workflow.pyto skip unaffected jobs.
Tags appended to the commit summary (case-sensitive) control CI behavior:
[skip-matrix]: Skip CCCL project build/test jobs. (Docs, devcontainers, and third-party builds still run.)[skip-vdc]: Skip "Verify Devcontainer" jobs. Safe unless CI or devcontainer infra is modified.[skip-docs]: Skip doc tests/previews. Safe if docs are unaffected.[skip-third-party-testing]/[skip-tpt]: Skip third-party smoke tests (MatX, PyTorch, RAPIDS).[skip-matx]: Skip building the MatX third-party smoke test.[skip-pytorch]: Skip building the PyTorch third-party smoke test.[skip-rapids]: Skip building the RAPIDS third-party smoke test.
⚠️ All of these tags block merging until removed and a full CI run (with no overrides) succeeds.
Use these tags for early iterations to save resources. Remove them before review/merge.
⚠️ Always run before committing. CI will fail otherwise.
pip install pre-commit
pre-commit install
pre-commit run --all-files
pre-commit run --files <file1> <file2>- Validate changes with builds/tests; report results.
- Run
pre-commitbefore committing. - Review
CONTRIBUTING.mdandci-overview.mdbefore starting work.
- Use development containers with
sccache(CCCL team only). - Limit architectures to reduce compile time (e.g.
-arch "native"or"80"if no GPU). - Build with Ninja for fast, parallel builds.
cccl/
├── .github/ # Workflows
├── .devcontainer/ # Dev containers
├── libcudacxx/ # CUDA C++ Standard Library
├── cub/ # CUB primitives
├── thrust/ # Thrust algorithms
├── cudax/ # Experimental features
├── c/ # C Parallel library
├── python/cuda_cccl/ # Python bindings
├── ci/ # Build/test scripts
├── examples/ # Usage examples
└── CMakePresets.json # Preset configurations
Python package layout:
python/cuda_cccl/
├── cuda/
│ ├── compute/
│ ├── coop/
│ └── cccl/
│ ├── parallel/
│ ├── cooperative/
│ └── headers/
├── tests/
├── benchmarks/
└── pyproject.toml