Skip to content

[None][chore] Convert cubins in repository to compressed archives#13542

Open
tongyuantongyu wants to merge 3 commits intoNVIDIA:mainfrom
tongyuantongyu:ytong/cubin-clean
Open

[None][chore] Convert cubins in repository to compressed archives#13542
tongyuantongyu wants to merge 3 commits intoNVIDIA:mainfrom
tongyuantongyu:ytong/cubin-clean

Conversation

@tongyuantongyu
Copy link
Copy Markdown
Member

@tongyuantongyu tongyuantongyu commented Apr 28, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented tarball-based cubin binary archiving (.cubin.tar.zst) with direct embedded linking
  • Chores

    • Updated kernel build pipelines to use new cubin archive distribution format
    • Migrated cubin embedding from C++ hex-arrays to direct binary embedding with namespace-scoped symbol linking

Description

Convert cubin sources from C array to compressed raw binaries. Drastically reduced LFS stored file size of them (3.7G -> 155MB).

Also unified the EXCLUDE_SM macro meaning across the whole codebase.

Test Coverage

No functional change. Exist tests verify nothing is broken.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@tongyuantongyu tongyuantongyu requested a review from a team as a code owner April 28, 2026 03:12
@tongyuantongyu tongyuantongyu self-assigned this Apr 28, 2026
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Unify the meaning of these macros to avoid conflict

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45840 [ run ] triggered by Bot. Commit: fd46e41 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

This PR introduces a tarball-based cubin distribution pipeline for embedding GPU kernel binaries into TensorRT-LLM. It replaces the legacy git-lfs pointer + xxd hex-array approach with per-cubin .tar.zst archives, new ABI namespace configuration, INCBIN-based C++ embedding infrastructure, and CMake build-time extraction/linking logic.

Changes

Cohort / File(s) Summary
CMake Infrastructure
.cmake-format.json, cpp/cmake/modules/cuda_configuration.cmake, cpp/cmake/modules/tllm_cubin_archive.cmake
New tllm_cubin_archive CMake module with tllm_add_cubin_archive_sources() function for extracting/embedding cubins; updated filter_source_cuda_architectures() to remove TARGET argument, extend .cubin.tar.zst filtering, and add family-architecture exclusion handling.
C++ ABI & Embedding
cpp/CMakeLists.txt, cpp/include/tensorrt_llm/common/cubinIncbin.h
Added TRTLLM_ABI_NAMESPACE CMake variable and preprocessor macro; new public header with TLLM_INCBIN() and TLLM_INCBIN_NS() macros for inline-assembly cubin embedding with Itanium ABI symbol mangling and namespace scoping.
Build System Updates
cpp/kernels/fmha_v2/Makefile, cpp/kernels/fmha_v2/setup.py, cpp/kernels/xqa/gen_cubins.py
Replaced xxd-based C++ byte-array generation with tarball archival (deterministic .tar.zst with fixed entry mtime); updated namespace handling and SM-based exclusion macros; changed cubin symbol types from unsigned long long* to unsigned char*.
CMakeLists.txt Kernel Updates
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/CMakeLists.txt, cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/CMakeLists.txt
Removed TARGET argument from filter_source_cuda_architectures() and integrated tllm_add_cubin_archive_sources() for embedding prebuilt cubins from cubin/ directories.
LFS Pointer Removals
.gitattributes, cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/*.cubin.cpp, cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/*.cubin.cpp
Added .cubin.tar.zst to git-lfs tracking; deleted git-lfs pointer metadata from 150+ cubin artifact files, replacing with tarball-based distribution.

Sequence Diagram

sequenceDiagram
    participant CMake as CMake Build System
    participant TarGlob as Tarball Glob/Filter
    participant Extract as Archive Extract (Build-Time)
    participant IncBin as C++ INCBIN Header
    participant Linker as Linker/Assembler

    CMake->>TarGlob: Glob *.cubin.tar.zst, filter by CUDA SM
    TarGlob->>Extract: Pass selected tarballs
    Extract->>Extract: Extract <stem>.cubin to build tree
    Extract->>IncBin: Prepare extracted cubin paths
    CMake->>IncBin: Generate aggregator .cpp with TLLM_INCBIN_NS() macros
    IncBin->>Linker: Emit inline asm .incbin directives with mangled linker symbols
    Linker->>Linker: Embed cubin bytes in .rodata, create _end/_len symbols
    Linker->>CMake: Link symbols into target
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

VisualGen

Suggested reviewers

  • Wanli-Jiang
  • niukuo
  • PerkzZheng
  • mzweilz
  • yuxianq
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: converting cubins to compressed archives, which aligns with the substantial refactoring shown across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description clearly explains the main objectives: converting cubins from C arrays to compressed raw binaries and unifying EXCLUDE_SM macros.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/kernels/xqa/gen_cubins.py (1)

330-352: ⚠️ Potential issue | 🟠 Major

Propagate cubin-generation failures instead of emitting partial metadata.

When nvcc or archive_cubin() fails, this worker just prints stderr and returns. That leaves cubin_size as None or emits metadata for a tarball that was never created, which turns the real failure into a much harder-to-debug header or link error later. Re-raise here so the pool fails fast.

Suggested fix
     except subprocess.CalledProcessError as e:
-        print(e.stderr)
+        print(e.stderr, file=sys.stderr)
+        raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/kernels/xqa/gen_cubins.py` around lines 330 - 352, The code currently
swallows failures in the nvcc subprocess.run and archive_cubin() call, leaving
cubin_size None and returning partial metadata; change the exception handling in
the block around subprocess.run / archive_cubin / os.remove so errors propagate:
instead of printing e.stderr and returning, re-raise the CalledProcessError (and
any exception from archive_cubin or os.remove) after logging if needed, ensuring
the function (the logic around build_commands, construct_name, archive_cubin and
the cubin_size variable) does not return successful metadata when generation or
archiving fails; remove the bare swallow and let the exception bubble up to fail
the pool fast.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h (1)

37-37: Use east-const for mCubin.

Please keep this declaration consistent with the repo style: unsigned char const* mCubin;.

As per coding guidelines "Use east-const style in C++: place const to the right of the type it qualifies (e.g., int const x rather than const int x)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h`
at line 37, Change the declaration of the kernel cubin pointer to east-const
style: replace the current `const unsigned char* mCubin;` with `unsigned char
const* mCubin;` so the const qualifier is to the right of the type it qualifies;
update the declaration wherever `mCubin` is defined (e.g., in the class/struct
containing `mCubin`) to match repository coding guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/cmake/modules/tllm_cubin_archive.cmake`:
- Around line 89-93: The file discovery using file(GLOB _ARCHIVES RELATIVE
"${ARCHIVE_DIR}" "${ARCHIVE_DIR}/*.cubin.tar.zst") is missing CONFIGURE_DEPENDS,
so CMake won't re-run configure when archives are added/removed; update the
file(GLOB ...) invocation that populates _ARCHIVES (and keep the existing
list(SORT _ARCHIVES)) to include CONFIGURE_DEPENDS for the pattern
"${ARCHIVE_DIR}/*.cubin.tar.zst" so archive discovery becomes reconfigure-safe
(this will ensure changes from gen_cubins.py or branch switches are picked up
automatically).
- Around line 160-167: This CMake module uses POSIX-only behaviors (e.g., touch
-r on "${_ARCHIVE_PATH}" and assembler flags like -Wa,-I) but is included
unconditionally; add an upfront platform guard that checks WIN32 at module
initialization and emits a FATAL_ERROR if on Windows so configuration fails
fast; modify tllm_cubin_archive.cmake to detect WIN32 and call
message(FATAL_ERROR ...) with a clear explanation (mentioning the POSIX-only
touch -r and assembler -Wa,-I usages and the cubinIncbin.h limitation) before
any use of variables like _ARCHIVE_PATH, _EXTRACTED, or EXTRACT_DIR, preventing
the module from being processed on Windows.

In `@cpp/include/tensorrt_llm/common/cubinIncbin.h`:
- Around line 57-59: The OS guard in cubinIncbin.h is too broad (it checks
__unix__ and thus allows non-Linux ELF-incompatible systems) so change the
preprocessor check to only allow Linux: replace the current conditional that
uses (defined(__linux__) || defined(__unix__)) with a simple check for __linux__
(i.e., use `#if` !defined(__linux__) ... `#endif`) so the ELF-specific asm block
(symbols in the file like the asm directives .section .rodata, .type, .size,
.previous) is rejected at preprocessing on non-Linux targets.

In `@cpp/kernels/xqa/gen_cubins.py`:
- Around line 405-415: Update the generated extern declarations in gen_cubins.py
so they match TLLM_INCBIN_NS signatures: change the data symbol to "extern
unsigned char const {cubin_variable_name}[]" and the length symbol to "extern
unsigned int const {cubin_variable_name}_len" (replace current uses that write
to cubin_data_array and cubin_length_array which currently emit non-const and
uint32_t variants); ensure you update the string templates that append to
cubin_data_array and cubin_length_array to include the const qualifiers and use
unsigned int for the length symbol.

In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/CMakeLists.txt`:
- Around line 91-97: The target still includes legacy cubin stubs because
SRC_CPP is populated via file(GLOB_RECURSE SRC_CPP *.cpp) and later added to
decoder_attention_src; before invoking tllm_add_cubin_archive_sources remove or
filter out any cubin/*_cubin.cpp (or matching pattern *_cubin.cpp) entries from
SRC_CPP so the legacy translation unit is not compiled alongside the
INCBIN-generated source; update the CMake logic that builds SRC_CPP to exclude
cubin/*_cubin.cpp or explicitly remove those paths from the list prior to adding
${SRC_CPP} to decoder_attention_src and before calling
tllm_add_cubin_archive_sources.

---

Outside diff comments:
In `@cpp/kernels/xqa/gen_cubins.py`:
- Around line 330-352: The code currently swallows failures in the nvcc
subprocess.run and archive_cubin() call, leaving cubin_size None and returning
partial metadata; change the exception handling in the block around
subprocess.run / archive_cubin / os.remove so errors propagate: instead of
printing e.stderr and returning, re-raise the CalledProcessError (and any
exception from archive_cubin or os.remove) after logging if needed, ensuring the
function (the logic around build_commands, construct_name, archive_cubin and the
cubin_size variable) does not return successful metadata when generation or
archiving fails; remove the bare swallow and let the exception bubble up to fail
the pool fast.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h`:
- Line 37: Change the declaration of the kernel cubin pointer to east-const
style: replace the current `const unsigned char* mCubin;` with `unsigned char
const* mCubin;` so the const qualifier is to the right of the type it qualifies;
update the declaration wherever `mCubin` is defined (e.g., in the class/struct
containing `mCubin`) to match repository coding guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment thread cpp/cmake/modules/tllm_cubin_archive.cmake
Comment thread cpp/cmake/modules/tllm_cubin_archive.cmake
Comment thread cpp/include/tensorrt_llm/common/cubinIncbin.h Outdated
Comment thread cpp/kernels/xqa/gen_cubins.py Outdated
Copy link
Copy Markdown
Collaborator

@pengbowang-nv pengbowang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reduce a lot of changes if we ignore xqa for now

@@ -356,7 +356,7 @@ class XQAKernelList
TKernelMeta const* mKernelMeta;
unsigned int mKernelMetaCount;
unsigned int mSM;
std::unordered_map<unsigned long long const*, CUlibrary> mCuLibs;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you re-export all the cubins or did you just compressed them? I'm asking this because XQA pre-compiled kernels are built from an ancient version and by exporting using current branch I doubt if it is going to work. Can we keep XQA for now as it only contains ~100 kernels and will be removed later? Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, all cubins are extracted from the cubin.cpp files and compressed. I just verified the fmha_v2 and xqa scripts can generate new cubins but didn't ship the updated cubins.

Comment thread cpp/CMakeLists.txt

project(tensorrt_llm LANGUAGES CXX)

# Single source of truth for the inline ABI namespace. Read by both the C++ side
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a collateral change brought by changing XQA, we can remove these it we just ignore XQA/DecoderMaskedAttention kernels for now.

Copy link
Copy Markdown
Member Author

@tongyuantongyu tongyuantongyu Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is required to determine the mangled symbol name for all cubins, not limited to XQA.

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot cancel

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45849 [ kill ] triggered by Bot. Commit: fd46e41 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45840 [ run ] completed with state ABORTED. Commit: fd46e41

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45849 [ kill ] completed with state SUCCESS. Commit: fd46e41
Successfully killed previous jobs for commit fd46e41

Link to invocation

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45867 [ run ] triggered by Bot. Commit: 4deeb4c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45867 [ run ] completed with state FAILURE. Commit: 4deeb4c
/LLM/main/L0_MergeRequest_PR pipeline #36044 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46046 [ run ] triggered by Bot. Commit: 4deeb4c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46046 [ run ] completed with state ABORTED. Commit: 4deeb4c

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46309 [ run ] triggered by Bot. Commit: 4deeb4c Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants