-
Notifications
You must be signed in to change notification settings - Fork 230
Closed
Labels
backend: cudaSpecific to CUDA execution (GPUs)Specific to CUDA execution (GPUs)component: PythonPython layerPython layerinstall
Description
Description
I'm installing WarpX for the first time in a while and running into nvcc out of memory errors while installing. Specifically:
nvcc error : 'ptxas' died due to signal 9 (Kill signal)
gmake[2]: *** [CMakeFiles/lib_1d.dir/build.make:1547: CMakeFiles/lib_1d.dir/Source/FieldSolver/FiniteDifferenceSolver/MacroscopicProperties/MacroscopicProperties.cpp.o] Error 9
gmake[1]: *** [CMakeFiles/Makefile2:3754: CMakeFiles/lib_1d.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
The SLURM job failure signal I get is "OUT_OF_MEMORY"
This occurs even when I provide quite large amounts of memory (>256 GB for a single process).
If I remove the 1D version from the install, then the 3D version fails at the python install step with the following error:
[ 76%] Built target pyamrex_pip_install
nvcc error : 'ptxas' died due to signal 9 (Kill signal)
gmake[3]: *** [CMakeFiles/lib_3d.dir/build.make:2087: CMakeFiles/lib_3d.dir/Source/Particles/PhysicalParticleContainer.cpp.o] Error 9
gmake[2]: *** [CMakeFiles/Makefile2:3880: CMakeFiles/lib_3d.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:4055: CMakeFiles/pip_install.dir/rule] Error 2
gmake: *** [Makefile:699: pip_install] Error 2
slurmstepd: error: Detected 2 oom_kill events in StepId=24637574.batch. Some of the step tasks have been OOM Killed.
2D completes OK.
A short git bisect later and I find this is the first bad commit (PR #5418):
commit bf905144acde28548a89ed1d415ace70e4d7d008
Author: Edoardo Zoni <[email protected]>
Date: Tue Oct 29 10:50:14 2024 -0700
AMReX/pyAMReX/PICSAR: weekly update (#5418)
- Weekly update to latest AMReX:
```console
./Tools/Release/updateAMReX.py
```
- Weekly update to latest pyAMReX:
```console
./Tools/Release/updatepyAMReX.py
```
- Weekly update to latest PICSAR (no changes):
```console
./Tools/Release/updatePICSAR.pyMy SLURM install script is below:
#!/bin/bash
#SBATCH --job-name=warpx-install
#SBATCH --account=###
#SBATCH --partition=###
#SBATCH --gpus=1
#SBATCH --cpus-per-gpu=20
#SBATCH --mem=256g
#SBATCH --time=4:00:00
#SBATCH --mail-type=END,FAIL
# Load required modules
source ~/warpx.profile
# uncomment to uninstall old versions
# #rm -rf build
rm -r *.whl
# activate venv
source ~/sw/lighthouse/h100/venvs/warpx-h100/bin/activate
# Build warpx
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DWarpX_LIB=ON \
-DWarpX_APP=ON \
-DWarpX_MPI=ON \
-DWarpX_DIMS="2;3" \
-DWarpX_PYTHON=ON \
-DWarpX_PRECISION=DOUBLE \
-DWarpX_PARTICLE_PRECISION=SINGLE \
-DWarpX_EB=OFF \
-DWarpX_QED=OFF \
-DWarpX_COMPUTE=CUDA
parallel=20
cmake --build build -j ${parallel}
cmake --build build --target pip_install -j ${parallel}
I've worked through this with@ax3l on slack and tried the following to no avail:
- Reducing the number of processes to one
- Using --exclusive to reserve a whole node
System information
- Operating system (name and version):
- Linux: e.g., Red Hat Enterprise Linux 8.8 (Ootpa)
- Version of WarpX: several
- Installation method:
- From source with CMake
- Computational resources:
- CPU: Intel(R) Xeon(R) CPU E5-2680 v3
- GPU: NVIDIA H100
If applicable, please add any additional information about your software environment:
- CMake: e.g., 3.24.0
- C++ compiler: gcc (GCC) 10.3.0 with nvcc v12.1.105
- Python: e.g., CPython 3.11.5
ax3l
Metadata
Metadata
Assignees
Labels
backend: cudaSpecific to CUDA execution (GPUs)Specific to CUDA execution (GPUs)component: PythonPython layerPython layerinstall