CUDA -> CuteDSL dim1 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma

Current: 5.9 tb/s 
Goal: 6.4 tb/s 

Currently writes scales in row major, requires extra kernel for per group blocked layout, non optimal.

Also CUDA is hard to maintain with ABI compatibility with different pytorch versions, and annoying to ship binaries with. We should use CuteDSL for the next iteration of this.