How to make stable TensorRT fp16 inference?

Execuse me, I am tring a dit based model on speech(audio) generation. the result is good when inference with fp32 weights and activations, but when inference with fp16, most value are overflow; while with bf16, the generated samples are poluted with noises.  So is there any exeperice on DiT-based models quantition after training? with the `pytorch->onnx->tensorrt` pipeline,  thanks.