Restarting with mpiexec fails

Using Python 3.12, ultranest 4.4.0 and h5py 3.15.1, all installed from conda-forge. 

I'm running it from cluster with slurm and lustre. It usually goes like this; I give it the maximum amount of time it can run, then after it halts due to timeout, I resubmit the job again which would automatically restart. The problem I'm describing did not occur before when I did not use mpiexec.

Now, I prepend `mpiexec -np $NCORES` or similar to make it parallel. It works great until it halts. If I resubmit the job, it stops with the following error:

```
Traceback (most recent call last):
  File "{myfile}", line 1300, in <module>
    run_simulation(
  File "{myfile}", line 1039, in run_simulation
    result = sampler.run(
             ^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2459, in run
    for _result in self.run_iter(
                   ^^^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2839, in run_iter
    self._update_results(main_iterator, saved_logl, saved_nodeids)
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2938, in _update_results
    results = combine_results(
              ^^^^^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/netiter.py", line 907, in combine_results
    saved_logwt_bs = np.concatenate(recv_saved_logwt_bs, axis=1)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 0 dimension(s)
```

From the error message, I get the feeling that this is due to the hdf5 file for storing points are not closed properly (or written fully) with mpiexec once it reached the timeout. That said, I had no problem of opening points.hdf5 file from an interactive session. Still, I'm not entirely sure what to look for in that file, given I don't know about the file structure.

Of course, I could try compiling my code against the system mpi and then try to run with srun where the signals are properly sent (or that is how I understand it from a short 30 min of googling mpi; I never used mpi before). But that would require too much effort (compared to running it over and over again with without mpiexec).

Do you think this is due to I/O error? Or, would there be any extra step I need to do to ensure the ultranest to run correctly under mpi?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting with mpiexec fails #176

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Restarting with mpiexec fails #176

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions