-
Notifications
You must be signed in to change notification settings - Fork 150
Description
Describe the bug
Runs stopped with the following error
cudaAssert: cudaErrorMemoryAllocation out of memory, file /home/yeluo/opt/qmcpack/src/Platforms/CUDA/MemManageCUDA.hpp, line 74
when calling cudaHostRegister. However, host memory usage is way below available DDR capacity.
To Reproduce
Steps to reproduce the behavior:
- all code release with DiracDeterminantBatched
- NiO performance benchmark a64 with 2048 walkers per rank.
- Running 2-4 ranks per node fails. 1 MPI rank works
- Each MPI rank sees all 4 GPUs.
Expected behavior
The simulation should run with 1-4 ranks.
System:
ALCF Polaris
Additional context
I injected counters to count the peak amount of registered host memory (pinned memory) segments.
1 MPI rank run completed with max at ~34k
2 MPI rank run hit error at max ~32k per rank
3 MPI rank run hit error at max ~21k per rank
4 MPI rank run hit error at max ~16k per rank
a. There is a cap at around 65536 magic number. My guess vm.max_map_count=65530
b. It seems MPI (Cray MPICH) related. Likely due to the notorious XPMEM.
c. workaround exposing one GPU per rank made all cases to run.
Long term solution from our side. We need to to bulk allocation/registration and views instead of doing that per walker.