-
Notifications
You must be signed in to change notification settings - Fork 67
Merge NV 4.2.1 to SYCL-TLA Main #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Merge NV 4.2.1 to SYCL-TLA Main #592
Conversation
* bwd GQA init * Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu * ref kernel type conversion fix --------- Co-authored-by: Haicheng Wu <[email protected]>
change version number to 4.2
* doc change * fix broken links * ragged gemm doc update * move around texts about moe gemm
* Update 03_tensor.md fix link typo change path to relative path * Update 03_tensor.md --------- Co-authored-by: Haicheng Wu <[email protected]>
* Rebase to latest * update * upd Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Update fmha_fusion.hpp * Update fmha_fusion.hpp fixed flipped logic for isQBegin * Update fmha_fusion.hpp * Avoid use of booleans The current expression is confusing * fmt * Update fmha_fusion.hpp Reproduce error/fix with: ./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend * add test, format --------- Co-authored-by: Richard Cai <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>
format change
update version to 4.2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR merges NVIDIA CUTLASS version 4.2.1 into the SYCL-TLA main branch. The primary purpose is to update the version number from 4.1.0 to 4.2.1 and incorporate upstream changes from the NVIDIA CUTLASS 4.2.1 release.
Key changes include:
- Version bump from 4.1.0 to 4.2.1 across Python and C++ components
- Addition of SM100/SM103/SM120/SM121 architecture support and related utilities
- Bug fixes in existing code (matrix operations, cross product calculation)
- Renaming of "BlockScaled" terminology to "Blockwise" for FP8 kernels
- Addition of new Python test utilities for SM100 shapes and instantiation levels
Reviewed Changes
Copilot reviewed 73 out of 73 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp |
Refactored alignment calculation to handle ElementA and ElementB separately |
test/python/cutlass/interface/gemm_interface.py |
Updated compute capability checks and schedule validation |
python/setup_pycute.py |
Version bump to 4.2.1 |
python/cutlass_library/sm90_utils.py |
Added blockwise schedule support |
python/cutlass_library/sm100_utils.py |
New file: SM100 kernel generation utilities |
python/cutlass_library/sm100_shapes.py |
New file: SM100 MMA and cluster shapes |
python/cutlass_library/manifest.py |
Renamed method for broader architecture support |
python/cutlass_library/library.py |
Added new kernel schedule types and SM100 support |
python/cutlass_library/gemm_operation.py |
Refactored SM103 FP4 ultra kernel schedule checks |
python/cutlass_library/generator.py |
Replaced manual instantiation with SM100 utilities |
include/cutlass/matrix.h |
Fixed unary negation operator bugs |
include/cutlass/version.h |
Version bump to 4.2.1 |
Comments suppressed due to low confidence (5)
test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp:1
- Trailing whitespace should be removed.
media/docs/cpp/profiler.md:1 - Corrected spelling of 'defination' to 'definition'.
media/docs/cpp/pipeline.md:1 - This correction appears to fix an error where "consumer threads" was changed to "producer threads". The
producer_acquireoperation should block producer threads, not consumer threads, so this is a correct fix.
media/docs/cpp/cute/02_layout_algebra.md:1 - The stride calculation was corrected from
(3*w,6*x,2*x,2*z)to(72*w,24*x,4*y,2*z). The original had an error using2*xinstead of the correct4*y.
media/docs/cpp/cute/02_layout_algebra.md:1 - The composition calculation was corrected from
(3*w,3*x,y,z)to(9*w,3*x,y,z). The first stride should be9*wnot3*w.
No description provided.