suggestion on FP8 tensor core in H100

Thanks a lot for your work of Qserve.
Now, we want to deploy the Qserve based on H100 GPU(which FP8 tensor core is supported), considering both the accuracy and throughput.
Do you have any suggestion of revision for Qserve? Or do you consider to  do some optimization on H100?