Memory using Redistribute on GPUs

Hello!
We are using AMReX for our Plasma Physics particle Code GEMPICX (still not public yet, I am afraid). While we were trying to simulate a big spatial problem on our current GPU cluster, we ran into memory problems using AMReX' Redistribute function: after a couple of time steps, the code aborts with the following message:

```
amrex::Abort::0::Out of gpu memory. Free: 4435902464 Asked: 8294402048 !!!
```

This is a prohibitively (and unnecessarily) large number for memory to ask for in our simulation.

The cluster that we simulate this problem on is the viper GPU cluster from MPCDF: 

```
GPU nodes:
- 228 GPU nodes, 456 APUs
- Processor type: AMD Instinct MI300A APU
- Main memory (HBM3) per APU: 128 GB
- 24 CPU cores per APU
- 228 GPU compute units per APU

https://docs.mpcdf.mpg.de/doc/computing/viper-gpu-user-guide.html
```

The following file is our main file reduced towards a particle push including our GEMPICX particle container and a Redistribute in each step:

[Reduced.cpp](https://github.com/user-attachments/files/22919706/Reduced.cpp)

Running this file on viper GPU, we get the following output and error files:

[job_viper.out.txt](https://github.com/user-attachments/files/22919728/job_viper.out.txt)

[Backtrace_viper.0.txt](https://github.com/user-attachments/files/22919733/Backtrace_viper.0.txt)

Using the input and submission file

[astro.input.txt](https://github.com/user-attachments/files/22919740/astro.input.txt)

[run_viper.sh](https://github.com/user-attachments/files/22919747/run_viper.sh)


Due to the overhead of the GEMPICX particle container, the following files are necessary to compile the code:

[GEMPIC_ComputationalDomain.cpp](https://github.com/user-attachments/files/22919763/GEMPIC_ComputationalDomain.cpp)

[GEMPIC_ComputationlDomain.H.txt](https://github.com/user-attachments/files/22919768/GEMPIC_ComputationlDomain.H.txt)

[GEMPIC_Config.H.txt](https://github.com/user-attachments/files/22919769/GEMPIC_Config.H.txt)

[GEMPIC_Parameters.H.txt](https://github.com/user-attachments/files/22919771/GEMPIC_Parameters.H.txt)

[GEMPIC_Parameters.cpp](https://github.com/user-attachments/files/22919776/GEMPIC_Parameters.cpp)

[GEMPIC_ParticleGroups.H.txt](https://github.com/user-attachments/files/22919780/GEMPIC_ParticleGroups.H.txt)

[GEMPIC_Verbosity.H.txt](https://github.com/user-attachments/files/22919786/GEMPIC_Verbosity.H.txt)

[GEMPIC_Verbosity.cpp](https://github.com/user-attachments/files/22919787/GEMPIC_Verbosity.cpp)


Sidenote: since viper GPU is providing AMD GPUs, we also did some testing on the classical Nvidia GPU cluster raven from MPCDF:

```
192 GPU-accelerated compute nodes 768 GPUs, 30 TB HBM2, 14.6 PFlop/s theoretical peak performance (FP64).

https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html
```

Note, that the following information was not obtained, using the Reduced.cpp file, but our original main file:

[raven.err.txt](https://github.com/user-attachments/files/22919855/raven.err.txt)

[run_raven.sh](https://github.com/user-attachments/files/22919873/run_raven.sh)

[Backtrace_raven.0.txt](https://github.com/user-attachments/files/22919880/Backtrace_raven.0.txt)


Thank you, for taking a look into this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory using Redistribute on GPUs #4715

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory using Redistribute on GPUs #4715

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions