For shape (128000, 7168)
- Current: ~5.5 tb/s
- Goal: ~6.4 tb/s.
Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout. We should write directly to blocked layout.
Also, we should try CuteDSL instead of Triton for this next iteration