DBCSR performs very poorly on GH200, when there are large blocks

I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the `benchmarks/QS/H2O-XXX.inp`) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration. 
The following data was obtained with the `becnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp` input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the `-D__DBCSR_ACC` flag.

Timings are in seconds, as per the CP2K output file:
| |Total | dbcsr_multiply_generic |
| ------------- | ------------- | ------- |
| with `-D__DBCSR_ACC` |  891.327 | 294.900|
|without `-D__DBCSR_ACC` | 608.230  | 18.406 |

With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.

I would appreciate any suggestion on how to solve this issue. What I have tried so far:
- Pretty much all keywords in the `&GLOCAL%DBCSR` input section of CP2K: no noticeable difference
- Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the `benchmarks/QS/H2O-XXX.inp` tests.
- Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.

Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DBCSR performs very poorly on GH200, when there are large blocks #795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Total	dbcsr_multiply_generic
with `-D__DBCSR_ACC`	891.327	294.900
without `-D__DBCSR_ACC`	608.230	18.406

DBCSR performs very poorly on GH200, when there are large blocks #795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions