Current: 5.9 tb/s
Goal: 6.4 tb/s
Currently writes scales in row major, requires extra kernel for per group blocked layout, non optimal.
Also CUDA is hard to maintain with ABI compatibility with different pytorch versions, and annoying to ship binaries with. We should use CuteDSL for the next iteration of this.