Skip to content

Conversation

@anamikac-intel
Copy link

No description provided.

andralex and others added 21 commits September 4, 2025 16:55
* bwd GQA init

* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu

* ref kernel type conversion fix

---------

Co-authored-by: Haicheng Wu <[email protected]>
change version number to 4.2
* doc change

* fix broken links

* ragged gemm doc update

* move around texts about moe gemm
* Update 03_tensor.md fix link typo

change path to relative path

* Update 03_tensor.md

---------

Co-authored-by: Haicheng Wu <[email protected]>
* Rebase to latest

* update

* upd

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Update fmha_fusion.hpp

* Update fmha_fusion.hpp

fixed flipped logic for isQBegin

* Update fmha_fusion.hpp

* Avoid use of booleans

The current expression is confusing

* fmt

* Update fmha_fusion.hpp

Reproduce error/fix with: 
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend

* add test, format

---------

Co-authored-by: Richard Cai <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
format change
update version to 4.2.1
@anamikac-intel anamikac-intel marked this pull request as ready for review October 30, 2025 18:18
@Antonyvance Antonyvance requested a review from Copilot November 5, 2025 06:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR merges NVIDIA CUTLASS version 4.2.1 into the SYCL-TLA main branch. The primary purpose is to update the version number from 4.1.0 to 4.2.1 and incorporate upstream changes from the NVIDIA CUTLASS 4.2.1 release.

Key changes include:

  • Version bump from 4.1.0 to 4.2.1 across Python and C++ components
  • Addition of SM100/SM103/SM120/SM121 architecture support and related utilities
  • Bug fixes in existing code (matrix operations, cross product calculation)
  • Renaming of "BlockScaled" terminology to "Blockwise" for FP8 kernels
  • Addition of new Python test utilities for SM100 shapes and instantiation levels

Reviewed Changes

Copilot reviewed 73 out of 73 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp Refactored alignment calculation to handle ElementA and ElementB separately
test/python/cutlass/interface/gemm_interface.py Updated compute capability checks and schedule validation
python/setup_pycute.py Version bump to 4.2.1
python/cutlass_library/sm90_utils.py Added blockwise schedule support
python/cutlass_library/sm100_utils.py New file: SM100 kernel generation utilities
python/cutlass_library/sm100_shapes.py New file: SM100 MMA and cluster shapes
python/cutlass_library/manifest.py Renamed method for broader architecture support
python/cutlass_library/library.py Added new kernel schedule types and SM100 support
python/cutlass_library/gemm_operation.py Refactored SM103 FP4 ultra kernel schedule checks
python/cutlass_library/generator.py Replaced manual instantiation with SM100 utilities
include/cutlass/matrix.h Fixed unary negation operator bugs
include/cutlass/version.h Version bump to 4.2.1
Comments suppressed due to low confidence (5)

test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp:1

  • Trailing whitespace should be removed.
    media/docs/cpp/profiler.md:1
  • Corrected spelling of 'defination' to 'definition'.
    media/docs/cpp/pipeline.md:1
  • This correction appears to fix an error where "consumer threads" was changed to "producer threads". The producer_acquire operation should block producer threads, not consumer threads, so this is a correct fix.
    media/docs/cpp/cute/02_layout_algebra.md:1
  • The stride calculation was corrected from (3*w,6*x,2*x,2*z) to (72*w,24*x,4*y,2*z). The original had an error using 2*x instead of the correct 4*y.
    media/docs/cpp/cute/02_layout_algebra.md:1
  • The composition calculation was corrected from (3*w,3*x,y,z) to (9*w,3*x,y,z). The first stride should be 9*w not 3*w.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.