Update MSLK Triton FP8 row quantization kernel to match CUDA arithmetic and delete the C++ quantize_fp8_per_row kernel (#224) #854

Job	Run time
generate-matrix / generate	4s
filter-matrix	6s
meta-pytorch/MSLK / build-manywheel-py3_10-rocm7_2	48m 49s
meta-pytorch/MSLK / build-manywheel-py3_10-rocm7_1	48m 54s
meta-pytorch/MSLK / build-manywheel-py3_10-cuda12_8	52m 45s
meta-pytorch/MSLK / build-manywheel-py3_10-cuda12_9	52m 43s
meta-pytorch/MSLK / build-manywheel-py3_10-cuda13_0	47m 47s
meta-pytorch/MSLK / build-manywheel-py3_10-cuda12_6	47m 47s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-cuda12_9	14s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-rocm7_1	15s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-cuda12_8	17s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-rocm7_2	12s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-cuda12_6	14s
meta-pytorch/MSLK / upload / upload-manywheel-py3_10-cuda13_0	16s
	5h 0m 23s

Provide feedback