Skip to content

CUDA -> CuteDSL dim1 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma #4053

@danielvegamyhre

Description

@danielvegamyhre

Current: 5.9 tb/s
Goal: 6.4 tb/s

Currently writes scales in row major, requires extra kernel for per group blocked layout, non optimal.

Also CUDA is hard to maintain with ABI compatibility with different pytorch versions, and annoying to ship binaries with. We should use CuteDSL for the next iteration of this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions