Using Python 3.12, ultranest 4.4.0 and h5py 3.15.1, all installed from conda-forge.
I'm running it from cluster with slurm and lustre. It usually goes like this; I give it the maximum amount of time it can run, then after it halts due to timeout, I resubmit the job again which would automatically restart. The problem I'm describing did not occur before when I did not use mpiexec.
Now, I prepend mpiexec -np $NCORES or similar to make it parallel. It works great until it halts. If I resubmit the job, it stops with the following error:
Traceback (most recent call last):
File "{myfile}", line 1300, in <module>
run_simulation(
File "{myfile}", line 1039, in run_simulation
result = sampler.run(
^^^^^^^^^^^^
File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2459, in run
for _result in self.run_iter(
^^^^^^^^^^^^^^
File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2839, in run_iter
self._update_results(main_iterator, saved_logl, saved_nodeids)
File "{conda-env}/lib/python3.12/site-packages/ultranest/integrator.py", line 2938, in _update_results
results = combine_results(
^^^^^^^^^^^^^^^^
File "{conda-env}/lib/python3.12/site-packages/ultranest/netiter.py", line 907, in combine_results
saved_logwt_bs = np.concatenate(recv_saved_logwt_bs, axis=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 0 dimension(s)
From the error message, I get the feeling that this is due to the hdf5 file for storing points are not closed properly (or written fully) with mpiexec once it reached the timeout. That said, I had no problem of opening points.hdf5 file from an interactive session. Still, I'm not entirely sure what to look for in that file, given I don't know about the file structure.
Of course, I could try compiling my code against the system mpi and then try to run with srun where the signals are properly sent (or that is how I understand it from a short 30 min of googling mpi; I never used mpi before). But that would require too much effort (compared to running it over and over again with without mpiexec).
Do you think this is due to I/O error? Or, would there be any extra step I need to do to ensure the ultranest to run correctly under mpi?
Using Python 3.12, ultranest 4.4.0 and h5py 3.15.1, all installed from conda-forge.
I'm running it from cluster with slurm and lustre. It usually goes like this; I give it the maximum amount of time it can run, then after it halts due to timeout, I resubmit the job again which would automatically restart. The problem I'm describing did not occur before when I did not use mpiexec.
Now, I prepend
mpiexec -np $NCORESor similar to make it parallel. It works great until it halts. If I resubmit the job, it stops with the following error:From the error message, I get the feeling that this is due to the hdf5 file for storing points are not closed properly (or written fully) with mpiexec once it reached the timeout. That said, I had no problem of opening points.hdf5 file from an interactive session. Still, I'm not entirely sure what to look for in that file, given I don't know about the file structure.
Of course, I could try compiling my code against the system mpi and then try to run with srun where the signals are properly sent (or that is how I understand it from a short 30 min of googling mpi; I never used mpi before). But that would require too much effort (compared to running it over and over again with without mpiexec).
Do you think this is due to I/O error? Or, would there be any extra step I need to do to ensure the ultranest to run correctly under mpi?