Skip to content

DBCSR performs very poorly on GH200, when there are large blocks #795

@abussy

Description

@abussy

I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the benchmarks/QS/H2O-XXX.inp) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration.
The following data was obtained with the becnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the -D__DBCSR_ACC flag.

Timings are in seconds, as per the CP2K output file:

Total dbcsr_multiply_generic
with -D__DBCSR_ACC 891.327 294.900
without -D__DBCSR_ACC 608.230 18.406

With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.

I would appreciate any suggestion on how to solve this issue. What I have tried so far:

  • Pretty much all keywords in the &GLOCAL%DBCSR input section of CP2K: no noticeable difference
  • Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the benchmarks/QS/H2O-XXX.inp tests.
  • Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.

Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions