Performance on H100 Slower Than A800 - A Question About FP64 Tensor Core Optimization #1296

tianninglyu · 2025-09-03T05:20:58Z

tianninglyu
Sep 3, 2025

Hello Quokka Team,

I've been running some tests and noticed that the code runs approximately twice as slow on an H100 GPU compared to an A800. Therefore, I was wondering if Quokka is currently optimized to take advantage of the H100's FP64 Tensor Cores. This could potentially explain the performance difference I'm observing.

Could you share any information on this? Thank you for your time and help.

Answered by markkrumholz

Sep 3, 2025

Just adding to the bot's comments: we have not so far done any performance tuning for H100 architecture, so I'm not surprised that we are not getting great performance there yet. Unfortunately there are no H100's installed yet on any of the big supercomputers where we have been running. We certainly will do this tuning, but we haven't yet because we don't yet have access to a large-scale installation with the correct hardware.

View full answer

2025-09-03T05:22:40Z

dosubot[bot]
bot Sep 3, 2025

Based on the available documentation and code, Quokka does not currently include optimizations to specifically leverage the H100's FP64 Tensor Cores. The build configuration enables general CUDA support but does not set architecture-specific flags or tuning for H100 or A800 GPUs, nor does it include code paths to direct compute kernels to use FP64 Tensor Cores. This likely explains the slower performance you're seeing on H100 compared to A800, as the code isn't taking advantage of H100's unique hardware features for double-precision operations. For more on general GPU optimization strategies in Quokka, see the GPU performance guidelines.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

markkrumholz · 2025-09-03T05:27:00Z

markkrumholz
Sep 3, 2025
Maintainer

Just adding to the bot's comments: we have not so far done any performance tuning for H100 architecture, so I'm not surprised that we are not getting great performance there yet. Unfortunately there are no H100's installed yet on any of the big supercomputers where we have been running. We certainly will do this tuning, but we haven't yet because we don't yet have access to a large-scale installation with the correct hardware.

0 replies

BenWibking · 2025-09-03T13:22:14Z

BenWibking
Sep 3, 2025
Maintainer

Quokka does not use Tensor Cores on any architecture. It would be difficult and time consuming to write code to use the tensor cores in a hydro code. I am not aware of any that use them.

We have never done specific tuning for newer NVIDIA models. We basically always get 1.7x better performance each generation, which is approximately the ratio of HBM memory bandwidth between generations. (The bump from V100 to A100 was slightly higher than this.)

I regularly run on H100s and see performance that is about 1.7x better than A100, which is expected based on the ratio of memory bandwidth. I have never run on A800, as that is a China-specific model. I would expect the H100 to have better performance based on the ratio of memory bandwidth of A800 to H100, so I am pretty surprised by this. I would have to see detailed profiling data to attempt to diagnose what's going on.

3 replies

BenWibking Sep 3, 2025
Maintainer

I should add that it's important to set the GPU architecture you're compiling for with AMReX_CUDA_ARCH CMake variable. This is described in the AMReX documentation here: https://amrex-codes.github.io/amrex/docs_html/GPU.html#enabling-cuda-support

There are also several different H100 models, so you may be running on a lower-end one. I would suggest running on a major cloud computing service that provides H100 94GB NVL models (https://www.nvidia.com/en-us/data-center/h100/). I have run on the GH200 Grace Hopper model (https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip), which should be almost identical to that model.

BenWibking Sep 3, 2025
Maintainer

The one architectural feature that may make a difference is that on newer generations, higher occupancy is required to obtain the full memory bandwidth. So aside from checking precisely which H100 model you have, I would look at profiling data to see what the delivered memory bandwidth is for the GPU kernels on the two models. It may indeed be lower on H100.

Fixing this is hard, since occupancy is determined by register pressure for most of our kernels. One optimization that can be done is - for the biggest kernels - to reduce the thread block size using amrex::ParallelFor where N is the block size: AMReX-Codes/amrex#2947. This requires manual testing for each kernel you tune to make sure it doesn't make the code slower.

tianninglyu Sep 4, 2025
Author

Hi Ben,

Thanks for your reply. According to the engineers at the supercomputing center I used, the ordinary computing units of H100 are weaker than those of A800. Therefore, if the code cannot activate the tensor core, it is expected to run slower on H100 than on A800.

I was using a short-term test account and currently no longer have access to that server. If there is an opportunity to test the H100 cluster again in the future (which I believe will be soon), I will keep you posted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance on H100 Slower Than A800 - A Question About FP64 Tensor Core Optimization #1296

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance on H100 Slower Than A800 - A Question About FP64 Tensor Core Optimization #1296

Uh oh!

tianninglyu Sep 3, 2025

Replies: 3 comments · 3 replies

Uh oh!

dosubot[bot] bot Sep 3, 2025

Uh oh!

markkrumholz Sep 3, 2025 Maintainer

Uh oh!

BenWibking Sep 3, 2025 Maintainer

Uh oh!

BenWibking Sep 3, 2025 Maintainer

Uh oh!

Uh oh!

BenWibking Sep 3, 2025 Maintainer

Uh oh!

tianninglyu Sep 4, 2025 Author

tianninglyu
Sep 3, 2025

Replies: 3 comments 3 replies

dosubot[bot]
bot Sep 3, 2025

markkrumholz
Sep 3, 2025
Maintainer

BenWibking
Sep 3, 2025
Maintainer

BenWibking Sep 3, 2025
Maintainer

BenWibking Sep 3, 2025
Maintainer

tianninglyu Sep 4, 2025
Author