You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aragog uses the serial CVODE binding and never opens an MPI handle,
but shared HPC modules can introduce MPI libraries into the shell
that then collide with import-time MPI calls inside transitive
dependencies (mpi4py inside netCDF4 / scikits-odes-sundials).
Adds a new "Running on HPC clusters" section to docs/How-to/installation.md
covering the three failure modes that show up in practice:
* MPI_Comm_dup before MPI_INIT — the symptom Harrison hit. Fix:
module unload openmpi (or module purge) before aragog run, plus
an mpi4py reference for further reading.
* Mismatched conda vs system MPI — netCDF4 / SUNDIALS fail to
load libmpi. Fix: stick to the conda env's own libmpi (nompi
builds) and clear LD_LIBRARY_PATH leaks via module purge.
* mpiexec / srun wrappers — Aragog must not be launched through
them. Fix: call aragog run directly, parallelise across SLURM
array tasks for ensembles.
No code change.
Copy file name to clipboardExpand all lines: docs/How-to/installation.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,6 +89,44 @@ The PROTEUS-side conda environment already pulls all three transitively through
89
89
!!! info "Fallbacks for development without CVODE or JAX"
90
90
If `scikits-odes-sundials` is not importable, Aragog issues a warning and falls back to scipy `Radau` or `BDF`. If `jax` is not importable with `use_jax_jacobian = true`, it falls back to CVODE's finite-difference Jacobian, which is slower and noisier near the rheological transition. Both fallbacks are sufficient for short standalone tests but are not suitable for production-tolerance multi-Myr runs.
91
91
92
+
## Running on HPC clusters
93
+
94
+
Aragog uses the **serial** CVODE binding and never calls `MPI_Init` or any other MPI primitive. The solver is single-process; parallelism in PROTEUS-coupled runs is exposed through ensembles of independent runs (e.g. one SLURM job per planet, or `pytest -n auto` for the local test suite), not through MPI within a single integration.
95
+
96
+
That deliberate choice surfaces three failure modes on shared HPC modules where MPI libraries are loaded by default. None are bugs in Aragog; all are environment-level conflicts that the local stacks can introduce.
97
+
98
+
### MPI_Comm_dup before MPI_INIT
99
+
100
+
```text
101
+
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
102
+
*** This is disallowed by the MPI standard.
103
+
*** Your MPI job will now abort.
104
+
```
105
+
106
+
Symptom: `aragog run …` aborts immediately with the message above. Cause: an OpenMPI module was loaded into the shell (often by a default `module load` block in `~/.bashrc` or by the cluster's login profile), and a transitive dependency of the Python stack (typically `mpi4py` imported lazily inside `netCDF4` or `scikits-odes-sundials`) tries to duplicate the world communicator before the user code calls `MPI_Init`. SUNDIALS' `mpi4py`-based wrapper does this on import in some builds.
107
+
108
+
Fix: unload the MPI module before launching Aragog.
aragog run config.toml --eos-dir "$ARAGOG_TEST_EOS_DIR"
114
+
```
115
+
116
+
For SLURM job scripts, prefer the same `module purge` + targeted load pattern at the top of the batch file; do not let the cluster's site-default `module load openmpi` propagate into the job environment. Reference for the underlying mpi4py initialisation behaviour: [mpi4py docs (3.0.1 PDF)](https://mpi4py.readthedocs.io/_/downloads/en/3.0.1/pdf/).
117
+
118
+
### Mismatched conda + system MPI
119
+
120
+
Symptom: `import netCDF4` or `import scikits_odes_sundials.cvode` raises a dynamic-loader error mentioning a missing `libmpi.so.<N>` or version mismatch. Cause: the conda environment built `netCDF4` against `libmpi` from `conda-forge`, but `LD_LIBRARY_PATH` points at the system OpenMPI install loaded by `module load`.
121
+
122
+
Fix: keep the conda env self-contained. Either install `nompi` builds (`conda install -c conda-forge "netcdf4=*=nompi*"`) or run with the conda env's own `libmpi` first on the path (`module purge` typically restores this).
123
+
124
+
### `mpiexec` / `srun` wrappers
125
+
126
+
Aragog must NOT be launched through `mpiexec`, `mpirun`, or `srun --mpi=pmi2`. Doing so spawns multiple ranks each running the same serial integration, which is at best wasteful (every rank computes the same trajectory) and at worst gives nondeterministic NetCDF output if multiple ranks try to write to the same file. The job script should call `aragog run …` (or PROTEUS's `proteus start …`) directly.
127
+
128
+
If you need ensembles, parallelise across array tasks (`#SBATCH --array=0-99`) and let each task run a single serial Aragog integration.
129
+
92
130
## Equation-of-state tables
93
131
94
132
The entropy solver requires a directory of pressure-entropy (P-S) lookup tables in the SPIDER format. The files needed and their format are documented in [Reference: data](../Reference/data.md).
0 commit comments