[None][chore] Convert cubins in repository to compressed archives#13542
[None][chore] Convert cubins in repository to compressed archives#13542tongyuantongyu wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Unify the meaning of these macros to avoid conflict Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
651e895 to
fd46e41
Compare
|
/bot run |
|
PR_Github #45840 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR introduces a tarball-based cubin distribution pipeline for embedding GPU kernel binaries into TensorRT-LLM. It replaces the legacy git-lfs pointer + xxd hex-array approach with per-cubin Changes
Sequence DiagramsequenceDiagram
participant CMake as CMake Build System
participant TarGlob as Tarball Glob/Filter
participant Extract as Archive Extract (Build-Time)
participant IncBin as C++ INCBIN Header
participant Linker as Linker/Assembler
CMake->>TarGlob: Glob *.cubin.tar.zst, filter by CUDA SM
TarGlob->>Extract: Pass selected tarballs
Extract->>Extract: Extract <stem>.cubin to build tree
Extract->>IncBin: Prepare extracted cubin paths
CMake->>IncBin: Generate aggregator .cpp with TLLM_INCBIN_NS() macros
IncBin->>Linker: Emit inline asm .incbin directives with mangled linker symbols
Linker->>Linker: Embed cubin bytes in .rodata, create _end/_len symbols
Linker->>CMake: Link symbols into target
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/kernels/xqa/gen_cubins.py (1)
330-352:⚠️ Potential issue | 🟠 MajorPropagate cubin-generation failures instead of emitting partial metadata.
When
nvccorarchive_cubin()fails, this worker just prints stderr and returns. That leavescubin_sizeasNoneor emits metadata for a tarball that was never created, which turns the real failure into a much harder-to-debug header or link error later. Re-raise here so the pool fails fast.Suggested fix
except subprocess.CalledProcessError as e: - print(e.stderr) + print(e.stderr, file=sys.stderr) + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/kernels/xqa/gen_cubins.py` around lines 330 - 352, The code currently swallows failures in the nvcc subprocess.run and archive_cubin() call, leaving cubin_size None and returning partial metadata; change the exception handling in the block around subprocess.run / archive_cubin / os.remove so errors propagate: instead of printing e.stderr and returning, re-raise the CalledProcessError (and any exception from archive_cubin or os.remove) after logging if needed, ensuring the function (the logic around build_commands, construct_name, archive_cubin and the cubin_size variable) does not return successful metadata when generation or archiving fails; remove the bare swallow and let the exception bubble up to fail the pool fast.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h (1)
37-37: Use east-const formCubin.Please keep this declaration consistent with the repo style:
unsigned char const* mCubin;.As per coding guidelines "Use east-const style in C++: place
constto the right of the type it qualifies (e.g.,int const xrather thanconst int x)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h` at line 37, Change the declaration of the kernel cubin pointer to east-const style: replace the current `const unsigned char* mCubin;` with `unsigned char const* mCubin;` so the const qualifier is to the right of the type it qualifies; update the declaration wherever `mCubin` is defined (e.g., in the class/struct containing `mCubin`) to match repository coding guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/cmake/modules/tllm_cubin_archive.cmake`:
- Around line 89-93: The file discovery using file(GLOB _ARCHIVES RELATIVE
"${ARCHIVE_DIR}" "${ARCHIVE_DIR}/*.cubin.tar.zst") is missing CONFIGURE_DEPENDS,
so CMake won't re-run configure when archives are added/removed; update the
file(GLOB ...) invocation that populates _ARCHIVES (and keep the existing
list(SORT _ARCHIVES)) to include CONFIGURE_DEPENDS for the pattern
"${ARCHIVE_DIR}/*.cubin.tar.zst" so archive discovery becomes reconfigure-safe
(this will ensure changes from gen_cubins.py or branch switches are picked up
automatically).
- Around line 160-167: This CMake module uses POSIX-only behaviors (e.g., touch
-r on "${_ARCHIVE_PATH}" and assembler flags like -Wa,-I) but is included
unconditionally; add an upfront platform guard that checks WIN32 at module
initialization and emits a FATAL_ERROR if on Windows so configuration fails
fast; modify tllm_cubin_archive.cmake to detect WIN32 and call
message(FATAL_ERROR ...) with a clear explanation (mentioning the POSIX-only
touch -r and assembler -Wa,-I usages and the cubinIncbin.h limitation) before
any use of variables like _ARCHIVE_PATH, _EXTRACTED, or EXTRACT_DIR, preventing
the module from being processed on Windows.
In `@cpp/include/tensorrt_llm/common/cubinIncbin.h`:
- Around line 57-59: The OS guard in cubinIncbin.h is too broad (it checks
__unix__ and thus allows non-Linux ELF-incompatible systems) so change the
preprocessor check to only allow Linux: replace the current conditional that
uses (defined(__linux__) || defined(__unix__)) with a simple check for __linux__
(i.e., use `#if` !defined(__linux__) ... `#endif`) so the ELF-specific asm block
(symbols in the file like the asm directives .section .rodata, .type, .size,
.previous) is rejected at preprocessing on non-Linux targets.
In `@cpp/kernels/xqa/gen_cubins.py`:
- Around line 405-415: Update the generated extern declarations in gen_cubins.py
so they match TLLM_INCBIN_NS signatures: change the data symbol to "extern
unsigned char const {cubin_variable_name}[]" and the length symbol to "extern
unsigned int const {cubin_variable_name}_len" (replace current uses that write
to cubin_data_array and cubin_length_array which currently emit non-const and
uint32_t variants); ensure you update the string templates that append to
cubin_data_array and cubin_length_array to include the const qualifiers and use
unsigned int for the length symbol.
In `@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/CMakeLists.txt`:
- Around line 91-97: The target still includes legacy cubin stubs because
SRC_CPP is populated via file(GLOB_RECURSE SRC_CPP *.cpp) and later added to
decoder_attention_src; before invoking tllm_add_cubin_archive_sources remove or
filter out any cubin/*_cubin.cpp (or matching pattern *_cubin.cpp) entries from
SRC_CPP so the legacy translation unit is not compiled alongside the
INCBIN-generated source; update the CMake logic that builds SRC_CPP to exclude
cubin/*_cubin.cpp or explicitly remove those paths from the list prior to adding
${SRC_CPP} to decoder_attention_src and before calling
tllm_add_cubin_archive_sources.
---
Outside diff comments:
In `@cpp/kernels/xqa/gen_cubins.py`:
- Around line 330-352: The code currently swallows failures in the nvcc
subprocess.run and archive_cubin() call, leaving cubin_size None and returning
partial metadata; change the exception handling in the block around
subprocess.run / archive_cubin / os.remove so errors propagate: instead of
printing e.stderr and returning, re-raise the CalledProcessError (and any
exception from archive_cubin or os.remove) after logging if needed, ensuring the
function (the logic around build_commands, construct_name, archive_cubin and the
cubin_size variable) does not return successful metadata when generation or
archiving fails; remove the bare swallow and let the exception bubble up to fail
the pool fast.
---
Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.h`:
- Line 37: Change the declaration of the kernel cubin pointer to east-const
style: replace the current `const unsigned char* mCubin;` with `unsigned char
const* mCubin;` so the const qualifier is to the right of the type it qualifies;
update the declaration wherever `mCubin` is defined (e.g., in the class/struct
containing `mCubin`) to match repository coding guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
pengbowang-nv
left a comment
There was a problem hiding this comment.
I think we can reduce a lot of changes if we ignore xqa for now
| @@ -356,7 +356,7 @@ class XQAKernelList | |||
| TKernelMeta const* mKernelMeta; | |||
| unsigned int mKernelMetaCount; | |||
| unsigned int mSM; | |||
| std::unordered_map<unsigned long long const*, CUlibrary> mCuLibs; | |||
There was a problem hiding this comment.
Did you re-export all the cubins or did you just compressed them? I'm asking this because XQA pre-compiled kernels are built from an ancient version and by exporting using current branch I doubt if it is going to work. Can we keep XQA for now as it only contains ~100 kernels and will be removed later? Thanks!
There was a problem hiding this comment.
In this PR, all cubins are extracted from the cubin.cpp files and compressed. I just verified the fmha_v2 and xqa scripts can generate new cubins but didn't ship the updated cubins.
|
|
||
| project(tensorrt_llm LANGUAGES CXX) | ||
|
|
||
| # Single source of truth for the inline ABI namespace. Read by both the C++ side |
There was a problem hiding this comment.
This might be a collateral change brought by changing XQA, we can remove these it we just ignore XQA/DecoderMaskedAttention kernels for now.
There was a problem hiding this comment.
Actually, this is required to determine the mangled symbol name for all cubins, not limited to XQA.
|
/bot cancel |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot kill |
|
PR_Github #45849 [ kill ] triggered by Bot. Commit: |
|
PR_Github #45840 [ run ] completed with state |
|
PR_Github #45849 [ kill ] completed with state |
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
|
/bot run |
|
PR_Github #45867 [ run ] triggered by Bot. Commit: |
|
PR_Github #45867 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46046 [ run ] triggered by Bot. Commit: |
|
PR_Github #46046 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #46309 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Release Notes
New Features
.cubin.tar.zst) with direct embedded linkingChores
Description
Convert cubin sources from C array to compressed raw binaries. Drastically reduced LFS stored file size of them (3.7G -> 155MB).
Also unified the
EXCLUDE_SMmacro meaning across the whole codebase.Test Coverage
No functional change. Exist tests verify nothing is broken.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.