Merge NV 4.2.1 to SYCL-TLA Main #592

anamikac-intel · 2025-10-30T16:44:57Z

No description provided.

* bwd GQA init * Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu * ref kernel type conversion fix --------- Co-authored-by: Haicheng Wu <[email protected]>

change version number to 4.2

* doc change * fix broken links * ragged gemm doc update * move around texts about moe gemm

* Update 03_tensor.md fix link typo change path to relative path * Update 03_tensor.md --------- Co-authored-by: Haicheng Wu <[email protected]>

…out_algebra doc (NVIDIA#2635)

* Rebase to latest * update * upd Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Update fmha_fusion.hpp * Update fmha_fusion.hpp fixed flipped logic for isQBegin * Update fmha_fusion.hpp * Avoid use of booleans The current expression is confusing * fmt * Update fmha_fusion.hpp Reproduce error/fix with: ./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend * add test, format --------- Co-authored-by: Richard Cai <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

format change

update version to 4.2.1

Copilot

Pull Request Overview

This PR merges NVIDIA CUTLASS version 4.2.1 into the SYCL-TLA main branch. The primary purpose is to update the version number from 4.1.0 to 4.2.1 and incorporate upstream changes from the NVIDIA CUTLASS 4.2.1 release.

Key changes include:

Version bump from 4.1.0 to 4.2.1 across Python and C++ components
Addition of SM100/SM103/SM120/SM121 architecture support and related utilities
Bug fixes in existing code (matrix operations, cross product calculation)
Renaming of "BlockScaled" terminology to "Blockwise" for FP8 kernels
Addition of new Python test utilities for SM100 shapes and instantiation levels

Reviewed Changes

Copilot reviewed 73 out of 73 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp`	Refactored alignment calculation to handle ElementA and ElementB separately
`test/python/cutlass/interface/gemm_interface.py`	Updated compute capability checks and schedule validation
`python/setup_pycute.py`	Version bump to 4.2.1
`python/cutlass_library/sm90_utils.py`	Added blockwise schedule support
`python/cutlass_library/sm100_utils.py`	New file: SM100 kernel generation utilities
`python/cutlass_library/sm100_shapes.py`	New file: SM100 MMA and cluster shapes
`python/cutlass_library/manifest.py`	Renamed method for broader architecture support
`python/cutlass_library/library.py`	Added new kernel schedule types and SM100 support
`python/cutlass_library/gemm_operation.py`	Refactored SM103 FP4 ultra kernel schedule checks
`python/cutlass_library/generator.py`	Replaced manual instantiation with SM100 utilities
`include/cutlass/matrix.h`	Fixed unary negation operator bugs
`include/cutlass/version.h`	Version bump to 4.2.1

Comments suppressed due to low confidence (5)

test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp:1

Trailing whitespace should be removed.
media/docs/cpp/profiler.md:1
Corrected spelling of 'defination' to 'definition'.
media/docs/cpp/pipeline.md:1
This correction appears to fix an error where "consumer threads" was changed to "producer threads". The producer_acquire operation should block producer threads, not consumer threads, so this is a correct fix.
media/docs/cpp/cute/02_layout_algebra.md:1
The stride calculation was corrected from (3*w,6*x,2*x,2*z) to (72*w,24*x,4*y,2*z). The original had an error using 2*x instead of the correct 4*y.
media/docs/cpp/cute/02_layout_algebra.md:1
The composition calculation was corrected from (3*w,3*x,y,z) to (9*w,3*x,y,z). The first stride should be 9*w not 3*w.

andralex and others added 21 commits September 4, 2025 16:55

Fix bugs in matrix.h (NVIDIA#2598)

2288c0c

Fix Copy_Atom type mismatch in sgemm_sm80.cu (NVIDIA#2582)

b6ccf34

Fix comment in mma_atom.hpp (NVIDIA#2579)

d98e7bf

Fix incorrect shapes in copy_atom doc comments. (NVIDIA#2575)

76c96b0

ex77 backwards GQA (NVIDIA#2556)

56f0718

* bwd GQA init * Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu * ref kernel type conversion fix --------- Co-authored-by: Haicheng Wu <[email protected]>

v4.2 tag release. (NVIDIA#2638)

6a35b4d

Update version.h

e7e0add

change version number to 4.2

doc change for 4.2 (NVIDIA#2639)

57e3cfb

* doc change * fix broken links * ragged gemm doc update * move around texts about moe gemm

Remove old-version dsl examples (NVIDIA#2645)

a49f806

Fix doc cute 03_tensor.md link typo (NVIDIA#2627)

df3923b

* Update 03_tensor.md fix link typo change path to relative path * Update 03_tensor.md --------- Co-authored-by: Haicheng Wu <[email protected]>

Fix: a calculation error in the example of dividing out in the 02_lay…

ebf5e5e

…out_algebra doc (NVIDIA#2635)

Fxied a typo in pipeline descript docs. (NVIDIA#2623)

6b73aed

add support matrix

59b61c6

v4.2.1 update. (NVIDIA#2667)

ee914c3

4.2.1 update

4260d4a

Rename python/cutlass to python/cutlass_cppgen (NVIDIA#2652)

177a82e

Update CHANGELOG.md

a8749e6

format change

Update pyproject.toml

f3fde58

update version to 4.2.1

Merge NV 4.2.1 to SYCL-TLA

7e4d678

Added missed change

534d48e

rolandschulz mentioned this pull request Oct 30, 2025

Merge NV main to SYCL-TLA main #588

Closed

anamikac-intel marked this pull request as ready for review October 30, 2025 18:18

kausikmaiti requested review from Antonyvance, amitchawla1, rolandschulz and tdeng5 October 31, 2025 05:06

tdeng5 approved these changes Oct 31, 2025

View reviewed changes

anamikac-intel and others added 2 commits October 31, 2025 14:28

Merge branch 'main' into anamikac/merge_nv_v4.2.1

5d9cdcc

Replace data_cutlass_cppgen with data_cutlass

f1a3514

anamikac-intel added 3 commits November 2, 2025 13:17

Merge branch 'main' into anamikac/merge_nv_v4.2.1

3475e40

Update library_defaults.py

6774ae7

Update gemm_testbed.py

33eaa42

Antonyvance approved these changes Nov 3, 2025

View reviewed changes

Merge branch 'main' into anamikac/merge_nv_v4.2.1

428bdc0

Antonyvance requested a review from Copilot November 5, 2025 06:59

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge NV 4.2.1 to SYCL-TLA Main #592

Merge NV 4.2.1 to SYCL-TLA Main #592

anamikac-intel commented Oct 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Merge NV 4.2.1 to SYCL-TLA Main #592

Are you sure you want to change the base?

Merge NV 4.2.1 to SYCL-TLA Main #592

Conversation

anamikac-intel commented Oct 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants