Skip to content

Memory using Redistribute on GPUs #4715

@martinapr

Description

@martinapr

Hello!
We are using AMReX for our Plasma Physics particle Code GEMPICX (still not public yet, I am afraid). While we were trying to simulate a big spatial problem on our current GPU cluster, we ran into memory problems using AMReX' Redistribute function: after a couple of time steps, the code aborts with the following message:

amrex::Abort::0::Out of gpu memory. Free: 4435902464 Asked: 8294402048 !!!

This is a prohibitively (and unnecessarily) large number for memory to ask for in our simulation.

The cluster that we simulate this problem on is the viper GPU cluster from MPCDF:

GPU nodes:
- 228 GPU nodes, 456 APUs
- Processor type: AMD Instinct MI300A APU
- Main memory (HBM3) per APU: 128 GB
- 24 CPU cores per APU
- 228 GPU compute units per APU

https://docs.mpcdf.mpg.de/doc/computing/viper-gpu-user-guide.html

The following file is our main file reduced towards a particle push including our GEMPICX particle container and a Redistribute in each step:

Reduced.cpp

Running this file on viper GPU, we get the following output and error files:

job_viper.out.txt

Backtrace_viper.0.txt

Using the input and submission file

astro.input.txt

run_viper.sh

Due to the overhead of the GEMPICX particle container, the following files are necessary to compile the code:

GEMPIC_ComputationalDomain.cpp

GEMPIC_ComputationlDomain.H.txt

GEMPIC_Config.H.txt

GEMPIC_Parameters.H.txt

GEMPIC_Parameters.cpp

GEMPIC_ParticleGroups.H.txt

GEMPIC_Verbosity.H.txt

GEMPIC_Verbosity.cpp

Sidenote: since viper GPU is providing AMD GPUs, we also did some testing on the classical Nvidia GPU cluster raven from MPCDF:

192 GPU-accelerated compute nodes 768 GPUs, 30 TB HBM2, 14.6 PFlop/s theoretical peak performance (FP64).

https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html

Note, that the following information was not obtained, using the Reduced.cpp file, but our original main file:

raven.err.txt

run_raven.sh

Backtrace_raven.0.txt

Thank you, for taking a look into this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions