Skip to content

Out of memory when compiling on versions after commit bf905144acde28 #5742

@archermarx

Description

@archermarx

Description

I'm installing WarpX for the first time in a while and running into nvcc out of memory errors while installing. Specifically:

nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
gmake[2]: *** [CMakeFiles/lib_1d.dir/build.make:1547: CMakeFiles/lib_1d.dir/Source/FieldSolver/FiniteDifferenceSolver/MacroscopicProperties/MacroscopicProperties.cpp.o] Error 9
gmake[1]: *** [CMakeFiles/Makefile2:3754: CMakeFiles/lib_1d.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....

The SLURM job failure signal I get is "OUT_OF_MEMORY"
This occurs even when I provide quite large amounts of memory (>256 GB for a single process).

If I remove the 1D version from the install, then the 3D version fails at the python install step with the following error:

[ 76%] Built target pyamrex_pip_install
nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
gmake[3]: *** [CMakeFiles/lib_3d.dir/build.make:2087: CMakeFiles/lib_3d.dir/Source/Particles/PhysicalParticleContainer.cpp.o] Error 9
gmake[2]: *** [CMakeFiles/Makefile2:3880: CMakeFiles/lib_3d.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:4055: CMakeFiles/pip_install.dir/rule] Error 2
gmake: *** [Makefile:699: pip_install] Error 2
slurmstepd: error: Detected 2 oom_kill events in StepId=24637574.batch. Some of the step tasks have been OOM Killed.

2D completes OK.

A short git bisect later and I find this is the first bad commit (PR #5418):

commit bf905144acde28548a89ed1d415ace70e4d7d008
Author: Edoardo Zoni <[email protected]>
Date:   Tue Oct 29 10:50:14 2024 -0700

    AMReX/pyAMReX/PICSAR: weekly update (#5418)

    - Weekly update to latest AMReX:
    ```console
    ./Tools/Release/updateAMReX.py
    ```
    - Weekly update to latest pyAMReX:
    ```console
    ./Tools/Release/updatepyAMReX.py
    ```
    - Weekly update to latest PICSAR (no changes):
    ```console
    ./Tools/Release/updatePICSAR.py

My SLURM install script is below:

#!/bin/bash
#SBATCH --job-name=warpx-install
#SBATCH --account=###
#SBATCH --partition=###
#SBATCH --gpus=1
#SBATCH --cpus-per-gpu=20
#SBATCH --mem=256g
#SBATCH --time=4:00:00
#SBATCH --mail-type=END,FAIL

# Load required modules
source ~/warpx.profile

# uncomment to uninstall old versions
# #rm -rf build
rm -r *.whl

# activate venv
source ~/sw/lighthouse/h100/venvs/warpx-h100/bin/activate

# Build warpx
cmake -S . -B build \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DWarpX_LIB=ON \
    -DWarpX_APP=ON \
    -DWarpX_MPI=ON \
    -DWarpX_DIMS="2;3" \
    -DWarpX_PYTHON=ON \
    -DWarpX_PRECISION=DOUBLE \
    -DWarpX_PARTICLE_PRECISION=SINGLE \
    -DWarpX_EB=OFF \
    -DWarpX_QED=OFF \
    -DWarpX_COMPUTE=CUDA

parallel=20

cmake --build build -j ${parallel}
cmake --build build --target pip_install -j ${parallel}

I've worked through this with@ax3l on slack and tried the following to no avail:

  1. Reducing the number of processes to one
  2. Using --exclusive to reserve a whole node

System information

  • Operating system (name and version):
    • Linux: e.g., Red Hat Enterprise Linux 8.8 (Ootpa)
  • Version of WarpX: several
  • Installation method:
    • From source with CMake
  • Computational resources:
    • CPU: Intel(R) Xeon(R) CPU E5-2680 v3
    • GPU: NVIDIA H100

If applicable, please add any additional information about your software environment:

  • CMake: e.g., 3.24.0
  • C++ compiler: gcc (GCC) 10.3.0 with nvcc v12.1.105
  • Python: e.g., CPython 3.11.5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions