[WIP] Support epilogue inputs in cutlass EVT #5440

jacobhinkle · 2025-10-28T13:58:06Z

No description provided.

…r-workspace

…' into jh/cutlass-epilogue-inputs

This is for descriptors only

…nputs

…c problem

github-actions · 2025-10-28T14:00:05Z

Description

Support epilogue inputs like bias and beta in CUTLASS kernel
Fix input/output ordering and dtype handling in Sm90 compute
Add proper argument passing for EVT nodes with scalar and tensor inputs
Implement Sm90AuxLoad and Sm90SrcFetch for bias tensor handling

Changes walkthrough 📝

Relevant files

Enhancement

codegen.cpp `Move and add fusion position helper functions` csrc/cutlass/codegen.cpp Added `fusionInputPosition` and `fusionOutputPosition` helper functions Moved helper functions from gemm.cpp to enable reuse Includes fusion and IR headers for Val and Fusion access	+16/-2
evt.cpp `Support epilogue inputs with proper EVT node arguments` csrc/cutlass/evt.cpp Added `getPointerCode` to generate input/output pointer casting Implemented `makeAuxLoadNode` for Sm90AuxLoad node creation Enhanced argument handling with key-value pairs in EVT nodes Added input validation for alpha, beta, and bias contiguity	+120/-42
gemm.cpp `Enable bias support in CUTLASS GEMM kernel` csrc/cutlass/gemm.cpp Added bias tensor handling in kernel configuration Introduced EpilogueTileShape and EpilogueScheduleType Set ElementC based on bias dtype when present Updated argument passing to include bias pointer	+50/-20
codegen.h `Declare fusion position utility functions` csrc/cutlass/codegen.h Declared `fusionInputPosition` and `fusionOutputPosition` Added Val forward declaration Header now supports new helper functions	+9/-0
evt.h `Update EVT node to support multiple arguments` csrc/cutlass/evt.h Replaced `argument` with `arguments` vector of key-value pairs Supports multiple named arguments per EVT node Enables proper scalar and tensor argument passing	+2/-2
cutlass.h `Add epilogue tiling parameters` csrc/scheduler/cutlass.h Added `epilogue_tile` parameter for tiling control Introduced `epilogue_stages` for circular buffering Default tile size set to 64x64	+7/-0

Bug fix

cutlass_compiled_kernel.cpp `Remove fixed argument count assumption` csrc/runtime/cutlass_compiled_kernel.cpp Removed outdated argument count validation Generalized tensor argument handling Prepares for dynamic input ordering	+0/-10

Tests

test_cutlass_scheduler.cpp `Add test for bias epilogue in CUTLASS` tests/cpp/test_cutlass_scheduler.cpp Added test for bias + beta epilogue with ReLU Created reference fusion for validation Tests proper ordering and computation Skips on unsupported GPU architectures	+154/-4

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The function `getPointerCode` computes the index for fusion outputs using `fusion_->inputs().size() + fusionOutputPosition(fusion_, tv)`, but it does not validate that the computed index is within the bounds of the combined inputs and outputs vector. This could lead to an out-of-bounds access in the generated code if the output position is incorrect or if outputs are not properly ordered. std::string getPointerCode(TensorView* tv) { int64_t index = -1; if (tv->isFusionInput()) { index = fusionInputPosition(fusion_, tv); } else if (tv->isFusionOutput()) { index = fusion_->inputs().size() + fusionOutputPosition(fusion_, tv); } else { NVF_THROW( "Cannot get pointer for TV ", tv->toString(), " which is not a fusion input or output"); } return "static_cast<" + dtypeToCutlass(tv->dtype()) + ">(inputs.at(" + std::to_string(index) + ").data_ptr)"; } Possible Issue* The code adds `params.epilogue_tile` dimensions to the `KernelTraits` struct, but uses hardcoded values `_64, _64` for `EpilogueTileShape`. This inconsistency could lead to incorrect epilogue tiling if `params.epilogue_tile.m` or `.n` are not 64. The values should be derived from the parameters to maintain consistency. using EpilogueTileShape = Shape<_64, _64>; Performance Issue The `check_input` lambda function validates contiguity of alpha, beta, and bias tensors, but creates a new lambda for each call. This could be optimized by reusing a single lambda or inlining the checks to reduce overhead, especially since these checks are performed during code generation which may affect compilation performance. auto check_input = [](TensorView* inp) { if (inp == nullptr) { // Allow null return; } // Check that input is contiguous const std::vector<std::optional<bool>>& contig = inp->getContiguity(); NVF_ERROR( std::all_of( contig.begin(), contig.end(), [](const std::optional<bool>& c) { return !c.has_value() \|\| c.value(); }), "Expected all inputs to ScaledMmaOp to be contiguous but found ", inp->toString()); }; check_input(alpha); check_input(beta); check_input(bias);

Previously, we supported a single `TensorView*` argument for each EVT node. Instead, this PR changes to allow an argument list, provided as a simple list of key-value string pairs. This provides more flexibility which is needed to support EVT nodes that require multiple parameters. This is needed for #5441 and #5440

jacobhinkle and others added 22 commits October 14, 2025 12:04

[WIP] Cleaning up cutlass codegen

7d761bd

Use uint8_t* for passing workspace. Still hitting compile error

6845884

better comment when passing bias

7f881ef

Fix ordering of output and compute dtypes in Sm90Compute

cf16401

Fix bug in EVT Sm90Compute type arguments

8892dc1

Merge remote-tracking branch 'origin/main' into jh/cutlass-input-orde…

aa0bd70

…r-workspace

Merge branch 'jh/fix_evt_types' into jh/cutlass-input-order-workspace

a44625d

Merge branch 'main' into jh/cutlass-input-order-workspace

ed624a4

Intermediate step toward enabling epilogue inputs like bias

2abe5fe

Manually create reference in bias test

8b621ce

Fix up dlsym

ba1a00c

Remove torch.h include

14feba1

Minor constness fix

563df70

Merge remote-tracking branch 'origin/jh/cutlass-input-order-workspace…

aad4bc4

…' into jh/cutlass-epilogue-inputs

[WIP] Improve argument setting

b4b5452

More work toward properly including arguments

8cd01f0

Fix nvfuser compile

c48d136

Sm100AuxLoad -> Sm90AuxLoad

c32bcab

This is for descriptors only

Format comment

3a32dc4

Merge remote-tracking branch 'origin/main' into jh/cutlass-epilogue-i…

93617f2

…nputs

Use Sm90SrcFetch for bias. Compiles but smem usage too high. Heuristi…

504da60

…c problem

Bias almost working

8a2e4f3

jacobhinkle mentioned this pull request Oct 28, 2025

Pass EVT arguments by configurable names #5443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Support epilogue inputs in cutlass EVT #5440

[WIP] Support epilogue inputs in cutlass EVT #5440

Uh oh!

jacobhinkle commented Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Support epilogue inputs in cutlass EVT #5440

Are you sure you want to change the base?

[WIP] Support epilogue inputs in cutlass EVT #5440

Uh oh!

Conversation

jacobhinkle commented Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants