[CK Tile] Grouped GEMM aquant mode and non-persistent kernel #3337

ErwinTerpstra · 2025-12-02T10:09:51Z

Proposed changes

This closes internal ticket LWPCK-4126.

The grouped GEMM quantized example already includes using a persistent kernel, but this is hard-coded and should be added to the GemmConfig options. Non-persistent kernel support should be added to the kernel which currently contains a static_assert that requires the kernel to be configured as persistent.

The example also only does grouped B quantization. Support for A quantization should be added. Note that it seems there is currently no pipeline for A quantization with B preshuffle, if needed that should be a follow-up issue.

Changes:

Implemented non-persistent support for the quantized grouped gemm kernel
Fixed some issues in the AQuant pipeline
Added kernel persistency configuration value to the existing example
Added tests for non-persistent kernel to existing test suite
Added AQuantGrouped mode to the existing example
Added tests for AQuantGrouped mode

Note that there's still a problem in the AQuant pipeline with MRepeat > 1 and TransposeC == false (there seems to be a bug using ds_bpermute). The example and tests now conditionally only use MRepeat == 1 in those cases.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

ThomasNing

Thanks for the contribution. LGTM overall except the above comments.

ThomasNing · 2025-12-03T05:37:22Z

example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

    static constexpr ck_tile::index_t K_Warp_Tile =
        get_k_from_preshuffled_warp_tile<PrecType, M_Warp_Tile>();

+    static constexpr bool TransposeC       = false;


We do not need to add this the base TransposeC is already false.

ThomasNing · 2025-12-03T05:38:40Z

example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

 };

+template <typename PrecType>
+struct GemmConfig_Aquant : public GemmConfigBase


We do not need to add a specific Gemm Config Aquant. It could directly use the GemmConfigComputeV3_2.

It used to have different settings for Aquant, but indeed not needed anymore. Removed it.

ThomasNing · 2025-12-03T05:39:14Z

example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

    static constexpr ck_tile::index_t NumWaveGroups = 1;
    static constexpr bool DoubleSmemBuffer          = false;
    static constexpr bool PreshuffleB               = false;
+    static constexpr bool Persistent                = false;


Should we add a Persistent Gemm Config?

I made it into a template parameter and added it to the command-line parameters so that both can easily be used.

ThomasNing · 2025-12-03T06:02:11Z

include/ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp

 template<> struct Dispatcher<fp8_t, bf8_t, float, 32, 32,  32, false> { using Type = WarpGemmMfma_f32_32x32x32_fp8_bf8; };
 template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32,  16, false> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8; };
 template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32,  16,  true> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8_CTransposed; };
+template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32,  32,  true> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8_CTransposed; };


We do not need the 32x32x32 for 8bit warp gemm scenario.

These were instantiated with the previous config/pipeline. But indeed don't seem necessary anymore. I removed them.

ThomasNing · 2025-12-03T06:13:51Z

test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp


    struct GroupedGemKernelParam_Mfma
    {
+        // HACK: There's a bug in the AQuant pipeline that causes MRepeat > 1 to be incorrect


We already solved the problem. Please sync up with the develop :)

Thanks, I removed the workaround.

… the correct one

…ed gemm quant, and add support code to example

…line selection logic

…ants

…spose C) and non-persistent kernel

…=32 variants" This reverts commit b3fd4d3.

… add persistency as runtime parameter

ThomasNing · 2025-12-04T06:09:57Z

@ErwinTerpstra Thanks for the change. Please merge with the develop then I could kick off the official CI run.

…ersistent-kernel

ErwinTerpstra · 2025-12-04T06:17:54Z

@ErwinTerpstra Thanks for the change. Please merge with the develop then I could kick off the official CI run.

Done!

krithalith requested review from ex-rzr and wj-laskowski December 2, 2025 12:39

krithalith added the organization: streamhpc label Dec 2, 2025

ThomasNing requested changes Dec 3, 2025

View reviewed changes

ErwinTerpstra added 11 commits December 3, 2025 07:54

wip: add aquant to grouped gemm quant example

0f38066

fix: properly handle hot loop count in aquant pipeline

5d4a91a

fix: add separate GemmConfig structs for AQuant, automatically select…

f740922

… the correct one

feat: finish support for a non-persistent kernel invocation for group…

0cb77e5

…ed gemm quant, and add support code to example

refactor: cleaned up grouped gemm quant example a bit by reusing pipe…

9a01e4a

…line selection logic

chore: add warp gemm dispatchers for a couple of TransposeC K=32 vari…

b3fd4d3

…ants

feat: add quant grouped gemm tests cases for aquant (regular and tran…

83f7832

…spose C) and non-persistent kernel

fix: update base pipeline classes according to changes in develop branch

e10df0a

Revert "chore: add warp gemm dispatchers for a couple of TransposeC K…

76897c5

…=32 variants" This reverts commit b3fd4d3.

feat: remove aquant config from grouped gemm quant example, update to…

e78bcbb

… add persistency as runtime parameter

chore: removed work-around for aquant bug that has been fixed

8985b70

ErwinTerpstra force-pushed the eterpstr/190-ck-tile-grouped-gemm-aquant-and-non-persistent-kernel branch from 6d8f465 to 8985b70 Compare December 3, 2025 08:57

ErwinTerpstra requested a review from ThomasNing December 3, 2025 09:02

Merge branch 'develop' into 190-ck-tile-grouped-gemm-aquant-and-non-p…

db21e62

…ersistent-kernel

[CK Tile] Grouped GEMM aquant mode and non-persistent kernel #3337

Are you sure you want to change the base?

[CK Tile] Grouped GEMM aquant mode and non-persistent kernel #3337

Conversation

ErwinTerpstra commented Dec 2, 2025

Proposed changes

Checklist

Discussion

Uh oh!

ThomasNing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasNing commented Dec 4, 2025

Uh oh!

ErwinTerpstra commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants