Skip to content

Triton -> CuteDSL dim0 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma #4052

@danielvegamyhre

Description

@danielvegamyhre

For shape (128000, 7168)

  • Current: ~5.5 tb/s
  • Goal: ~6.4 tb/s.

Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout. We should write directly to blocked layout.

Also, we should try CuteDSL instead of Triton for this next iteration

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions