`GMMA::ScaleOut::Zero` Not Equivalent to `clear()` ? #2284

XieXiating · 2025-05-07T11:11:02Z

XieXiating
May 7, 2025

I'm trying to develop a GEMM example utilizing TMA and WGMMA on NVIDIA Hopper GPUs.

In the CUTLASS examples, the accumulator is initialized using:
tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;

This sets the scale_d flag in the underlying wgmma.mma_async instruction to zero, ensuring the computation follows D = A @ B rather than D = A @ B + D.

However, in my implementation, this approach doesn't work as expected.

For example, if K = 256 and all input values are 1.0, the correct result should be 256. However, the output produced by this code is 144.

Here's a simplified snippet of the code.

    // ...
    // Allocate the accumulators
    Tensor accum = partition_fragment_C(tiled_mma, take<0, 2>(TileShapeMNK{})); // (MMA,MMA_M,MMA_N)
    tiled_mma.accumulate_ = GMMA::ScaleOut::Zero; // Init mma accumulate_

    CUTE_NO_UNROLL
    for (int k_tile_idx = 0; k_tile_idx < k_tile_count; ++k_tile_idx) {
        // ... copy A, B and sync

        warpgroup_fence_operand(accum);
        warpgroup_arrive();
        // (V,M,K) x (V,N,K) => (V,M,N)
        cute::gemm(tiled_mma, tCrA, tCrB, accum);
        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
        warpgroup_commit_batch(); // wgmma.commit_group
        warpgroup_wait<0>();      // wgmma.wait_group, Wait for all MMAs in a K_TILE to complete
        warpgroup_fence_operand(accum);
    }

Conversely, explicitly invoking clear(accum); as follows yields correct results:

    // ...
    // Allocate the accumulators
    Tensor accum = partition_fragment_C(tiled_mma, take<0, 2>(TileShapeMNK{})); // (MMA,MMA_M,MMA_N)
    clear(accum);

    CUTE_NO_UNROLL
    for (int k_tile_idx = 0; k_tile_idx < k_tile_count; ++k_tile_idx) {
        // ... copy A, B and sync

        warpgroup_fence_operand(accum);
        warpgroup_arrive();
        // (V,M,K) x (V,N,K) => (V,M,N)
        cute::gemm(tiled_mma, tCrA, tCrB, accum);
        warpgroup_commit_batch(); // wgmma.commit_group
        warpgroup_wait<0>();      // wgmma.wait_group, Wait for all MMAs in a K_TILE to complete
        warpgroup_fence_operand(accum);
    }

I would appreciate any insights into why tiled_mma.accumulate_ = GMMA::ScaleOut::Zero; doesn't behave as intended in this context. Is there a specific usage pattern or prerequisite I'm missing?

For reference, I'm utilizing the SM90_64x256x16_F32F16F16_SS MMA atom.
Any guidance or suggestions would be greatly appreciated!

Answered by thakkarV

May 7, 2025

You're setting it to zero for the entire first k tile, which is more than one MMA. Have to do it only for the first k block only. See cutlass mainloops that unroll the K iteration of cute::GEMM to be able to set scale value to one after the first mma

View full answer

thakkarV · 2025-05-07T12:06:59Z

thakkarV
May 7, 2025
Collaborator

You're setting it to zero for the entire first k tile, which is more than one MMA. Have to do it only for the first k block only. See cutlass mainloops that unroll the K iteration of cute::GEMM to be able to set scale value to one after the first mma

2 replies

XieXiating May 7, 2025
Author

Thanks!

XieXiating May 7, 2025
Author

As a follow-up question — is there any performance difference between using clear() and setting tiled_mma.accumulate_ = GMMA::ScaleOut::Zero ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`GMMA::ScaleOut::Zero` Not Equivalent to `clear()` ? #2284

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GMMA::ScaleOut::Zero Not Equivalent to clear() ? #2284

Uh oh!

XieXiating May 7, 2025

Replies: 1 comment · 2 replies

Uh oh!

thakkarV May 7, 2025 Collaborator

Uh oh!

XieXiating May 7, 2025 Author

Uh oh!

XieXiating May 7, 2025 Author

`GMMA::ScaleOut::Zero` Not Equivalent to `clear()` ? #2284

XieXiating
May 7, 2025

Replies: 1 comment 2 replies

thakkarV
May 7, 2025
Collaborator

XieXiating May 7, 2025
Author

XieXiating May 7, 2025
Author