Skip to content

Conversation

@ErwinTerpstra
Copy link
Contributor

Proposed changes

This closes internal ticket LWPCK-4126.

The grouped GEMM quantized example already includes using a persistent kernel, but this is hard-coded and should be added to the GemmConfig options. Non-persistent kernel support should be added to the kernel which currently contains a static_assert that requires the kernel to be configured as persistent.

The example also only does grouped B quantization. Support for A quantization should be added. Note that it seems there is currently no pipeline for A quantization with B preshuffle, if needed that should be a follow-up issue.

Changes:

  • Implemented non-persistent support for the quantized grouped gemm kernel
  • Fixed some issues in the AQuant pipeline
  • Added kernel persistency configuration value to the existing example
  • Added tests for non-persistent kernel to existing test suite
  • Added AQuantGrouped mode to the existing example
  • Added tests for AQuantGrouped mode

Note that there's still a problem in the AQuant pipeline with MRepeat > 1 and TransposeC == false (there seems to be a bug using ds_bpermute). The example and tests now conditionally only use MRepeat == 1 in those cases.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copy link
Contributor

@ThomasNing ThomasNing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. LGTM overall except the above comments.

static constexpr ck_tile::index_t K_Warp_Tile =
get_k_from_preshuffled_warp_tile<PrecType, M_Warp_Tile>();

static constexpr bool TransposeC = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to add this the base TransposeC is already false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it

};

template <typename PrecType>
struct GemmConfig_Aquant : public GemmConfigBase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to add a specific Gemm Config Aquant. It could directly use the GemmConfigComputeV3_2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to have different settings for Aquant, but indeed not needed anymore. Removed it.

static constexpr ck_tile::index_t NumWaveGroups = 1;
static constexpr bool DoubleSmemBuffer = false;
static constexpr bool PreshuffleB = false;
static constexpr bool Persistent = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a Persistent Gemm Config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it into a template parameter and added it to the command-line parameters so that both can easily be used.

template<> struct Dispatcher<fp8_t, bf8_t, float, 32, 32, 32, false> { using Type = WarpGemmMfma_f32_32x32x32_fp8_bf8; };
template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32, 16, false> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8; };
template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32, 16, true> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8_CTransposed; };
template<> struct Dispatcher<bf8_t, fp8_t, float, 32, 32, 32, true> { using Type = WarpGemmMfma_f32_32x32x16_bf8_fp8_CTransposed; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need the 32x32x32 for 8bit warp gemm scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were instantiated with the previous config/pipeline. But indeed don't seem necessary anymore. I removed them.


struct GroupedGemKernelParam_Mfma
{
// HACK: There's a bug in the AQuant pipeline that causes MRepeat > 1 to be incorrect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already solved the problem. Please sync up with the develop :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I removed the workaround.

@ErwinTerpstra ErwinTerpstra force-pushed the eterpstr/190-ck-tile-grouped-gemm-aquant-and-non-persistent-kernel branch from 6d8f465 to 8985b70 Compare December 3, 2025 08:57
@ThomasNing
Copy link
Contributor

@ErwinTerpstra Thanks for the change. Please merge with the develop then I could kick off the official CI run.

@ErwinTerpstra
Copy link
Contributor Author

@ErwinTerpstra Thanks for the change. Please merge with the develop then I could kick off the official CI run.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants