Update MSLK Triton FP8 row quantization kernel to match CUDA arithmetic and delete the C++ quantize_fp8_per_row kernel (#224) #852