-
Notifications
You must be signed in to change notification settings - Fork 421
Description
Hello!
We are using AMReX for our Plasma Physics particle Code GEMPICX (still not public yet, I am afraid). While we were trying to simulate a big spatial problem on our current GPU cluster, we ran into memory problems using AMReX' Redistribute function: after a couple of time steps, the code aborts with the following message:
amrex::Abort::0::Out of gpu memory. Free: 4435902464 Asked: 8294402048 !!!
This is a prohibitively (and unnecessarily) large number for memory to ask for in our simulation.
The cluster that we simulate this problem on is the viper GPU cluster from MPCDF:
GPU nodes:
- 228 GPU nodes, 456 APUs
- Processor type: AMD Instinct MI300A APU
- Main memory (HBM3) per APU: 128 GB
- 24 CPU cores per APU
- 228 GPU compute units per APU
https://docs.mpcdf.mpg.de/doc/computing/viper-gpu-user-guide.html
The following file is our main file reduced towards a particle push including our GEMPICX particle container and a Redistribute in each step:
Running this file on viper GPU, we get the following output and error files:
Using the input and submission file
Due to the overhead of the GEMPICX particle container, the following files are necessary to compile the code:
GEMPIC_ComputationalDomain.cpp
GEMPIC_ComputationlDomain.H.txt
Sidenote: since viper GPU is providing AMD GPUs, we also did some testing on the classical Nvidia GPU cluster raven from MPCDF:
192 GPU-accelerated compute nodes 768 GPUs, 30 TB HBM2, 14.6 PFlop/s theoretical peak performance (FP64).
https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html
Note, that the following information was not obtained, using the Reduced.cpp file, but our original main file:
Thank you, for taking a look into this!