-
Notifications
You must be signed in to change notification settings - Fork 53
Description
I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the benchmarks/QS/H2O-XXX.inp) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration.
The following data was obtained with the becnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the -D__DBCSR_ACC flag.
Timings are in seconds, as per the CP2K output file:
| Total | dbcsr_multiply_generic | |
|---|---|---|
with -D__DBCSR_ACC |
891.327 | 294.900 |
without -D__DBCSR_ACC |
608.230 | 18.406 |
With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.
I would appreciate any suggestion on how to solve this issue. What I have tried so far:
- Pretty much all keywords in the
&GLOCAL%DBCSRinput section of CP2K: no noticeable difference - Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the
benchmarks/QS/H2O-XXX.inptests. - Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.
Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.