Performance on H100 Slower Than A800 - A Question About FP64 Tensor Core Optimization #1296
-
|
Hello Quokka Team, I've been running some tests and noticed that the code runs approximately twice as slow on an H100 GPU compared to an A800. Therefore, I was wondering if Quokka is currently optimized to take advantage of the H100's FP64 Tensor Cores. This could potentially explain the performance difference I'm observing. Could you share any information on this? Thank you for your time and help. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
|
Based on the available documentation and code, Quokka does not currently include optimizations to specifically leverage the H100's FP64 Tensor Cores. The build configuration enables general CUDA support but does not set architecture-specific flags or tuning for H100 or A800 GPUs, nor does it include code paths to direct compute kernels to use FP64 Tensor Cores. This likely explains the slower performance you're seeing on H100 compared to A800, as the code isn't taking advantage of H100's unique hardware features for double-precision operations. For more on general GPU optimization strategies in Quokka, see the GPU performance guidelines. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
Just adding to the bot's comments: we have not so far done any performance tuning for H100 architecture, so I'm not surprised that we are not getting great performance there yet. Unfortunately there are no H100's installed yet on any of the big supercomputers where we have been running. We certainly will do this tuning, but we haven't yet because we don't yet have access to a large-scale installation with the correct hardware. |
Beta Was this translation helpful? Give feedback.
-
|
Quokka does not use Tensor Cores on any architecture. It would be difficult and time consuming to write code to use the tensor cores in a hydro code. I am not aware of any that use them. We have never done specific tuning for newer NVIDIA models. We basically always get 1.7x better performance each generation, which is approximately the ratio of HBM memory bandwidth between generations. (The bump from V100 to A100 was slightly higher than this.) I regularly run on H100s and see performance that is about 1.7x better than A100, which is expected based on the ratio of memory bandwidth. I have never run on A800, as that is a China-specific model. I would expect the H100 to have better performance based on the ratio of memory bandwidth of A800 to H100, so I am pretty surprised by this. I would have to see detailed profiling data to attempt to diagnose what's going on. |
Beta Was this translation helpful? Give feedback.
Just adding to the bot's comments: we have not so far done any performance tuning for H100 architecture, so I'm not surprised that we are not getting great performance there yet. Unfortunately there are no H100's installed yet on any of the big supercomputers where we have been running. We certainly will do this tuning, but we haven't yet because we don't yet have access to a large-scale installation with the correct hardware.