Skip to content

Restarting with mpiexec fails #176

@HojinCho

Description

@HojinCho

Using Python 3.12, ultranest 4.4.0 and h5py 3.15.1, all installed from conda-forge.

I'm running it from cluster with slurm and lustre. It usually goes like this; I give it the maximum amount of time it can run, then after it halts due to timeout, I resubmit the job again which would automatically restart. The problem I'm describing did not occur before when I did not use mpiexec.

Now, I prepend mpiexec -np $NCORES or similar to make it parallel. It works great until it halts. If I resubmit the job, it stops with the following error:

Traceback (most recent call last):
  File "{myfile}", line 1300, in <module>
    run_simulation(
  File "{myfile}", line 1039, in run_simulation
    result = sampler.run(
             ^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2459, in run
    for _result in self.run_iter(
                   ^^^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2839, in run_iter
    self._update_results(main_iterator, saved_logl, saved_nodeids)
  File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2938, in _update_results
    results = combine_results(
              ^^^^^^^^^^^^^^^^
  File "{conda-env}/lib/python3.12/site-packages/ultranest/netiter.py", line 907, in combine_results
    saved_logwt_bs = np.concatenate(recv_saved_logwt_bs, axis=1)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 0 dimension(s)

From the error message, I get the feeling that this is due to the hdf5 file for storing points are not closed properly (or written fully) with mpiexec once it reached the timeout. That said, I had no problem of opening points.hdf5 file from an interactive session. Still, I'm not entirely sure what to look for in that file, given I don't know about the file structure.

Of course, I could try compiling my code against the system mpi and then try to run with srun where the signals are properly sent (or that is how I understand it from a short 30 min of googling mpi; I never used mpi before). But that would require too much effort (compared to running it over and over again with without mpiexec).

Do you think this is due to I/O error? Or, would there be any extra step I need to do to ensure the ultranest to run correctly under mpi?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions