Skip to content

Out of memory issue on Polaris due to CUDA pinned memory on Polaris #5557

@ye-luo

Description

@ye-luo

Describe the bug
Runs stopped with the following error

cudaAssert: cudaErrorMemoryAllocation out of memory, file /home/yeluo/opt/qmcpack/src/Platforms/CUDA/MemManageCUDA.hpp, line 74

when calling cudaHostRegister. However, host memory usage is way below available DDR capacity.

To Reproduce
Steps to reproduce the behavior:

  1. all code release with DiracDeterminantBatched
  2. NiO performance benchmark a64 with 2048 walkers per rank.
  3. Running 2-4 ranks per node fails. 1 MPI rank works
  4. Each MPI rank sees all 4 GPUs.

Expected behavior
The simulation should run with 1-4 ranks.

System:
ALCF Polaris

Additional context
I injected counters to count the peak amount of registered host memory (pinned memory) segments.
1 MPI rank run completed with max at ~34k
2 MPI rank run hit error at max ~32k per rank
3 MPI rank run hit error at max ~21k per rank
4 MPI rank run hit error at max ~16k per rank

a. There is a cap at around 65536 magic number. My guess vm.max_map_count=65530
b. It seems MPI (Cray MPICH) related. Likely due to the notorious XPMEM.
c. workaround exposing one GPU per rank made all cases to run.

Long term solution from our side. We need to to bulk allocation/registration and views instead of doing that per walker.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions