Skip to content

Can't perform independent write when MPI_File_sync is required by ROMIO driver. #1093

Open
@nahaharo

Description

@nahaharo

Hello.
Recently I've been doing mpi job for extensive data processing.
However, I'm getting the "Can't perform independent write when MPI_File_sync is required by ROMIO driver." error from the log.

The symptom is like the following:

  1. works well in the local (master) machine.
  2. when runs on nodes, it gives the error "Can't perform independent write when MPI_File_sync is required by ROMIO driver.".
  3. with dxpl_mpio=:collective, it stucks at write.

The local machine is connected directly to a disk, which I will write the hdf5 file, and the remote device was connected to a disk with NFS.

My question is, why does that error appear?
Does it appear because it uses NFS?
If this is avoidable, then how?

And also, In the article "https://www.hdfgroup.org/2015/08/parallel-io-with-hdf5/" there is H5Sselect_none operation for collective mode. Does HDF5.jl have similar functionality? If so, how can I use it?

Thanks.

Here is my test code.

using HDF5
using MPI

function main()
    @assert HDF5.has_parallel()

    MPI.Init()
    
    comm = MPI.COMM_WORLD
    info = MPI.Info()
    ff = h5open("test.h5", "w", comm, info)
    MPI.Barrier(comm)
    
    Nproc = MPI.Comm_size(comm)
    myrank = MPI.Comm_rank(comm)
    M = 10
    A = fill(myrank, M, 2)  # local data
    dims = (M, Nproc*2+1)    # dimensions of global data
    
    # Create dataset
    @show "Create dataset"
    dset = create_dataset(ff, "/data", datatype(eltype(A)), dataspace(dims), chunk=(M, 2), dxpl_mpio=:collective)
    @show "After dataset"
    
    # Write local data
    dset[:, 2*myrank + 1:2*myrank + 2] = A
    @show "After write dataset"

    close(ff)
    
    MPI.Finalize()
end

main()

And my result of "MPIPreferences.use_system_binary()".

julia> MPIPreferences.use_system_binary()
┌ Info: MPI implementation identified
│   libmpi = "libmpi"
│   version_string = "MPICH Version:      4.1.2\nMPICH Release date: Wed Jun  7 15:22:45 CDT 2023\nMPICH ABI:          15:1:3\nMPICH Device:       ch4:ofi\nMPICH configure:    --prefix=/home/---/tools/mpich --with-ucx=/home/---/tools/ucx\nMPICH CC:           /home/---/tools/gcc/bin/gcc    -O2\nMPICH CXX:          /home/hyunwook/tools/gcc/bin/g++   -O2\nMPICH F77:          /home/---/tools/gcc/bin/gfortran   -O2\nMPICH FC:           /home/---/tools/gcc/bin/gfortran   -O2\n"
│   imply = "MPICH"
│   version = v"4.1.2"
└   abi = "MPICH"
┌ Info: MPIPreferences unchanged
│   binary = "system"
│   libmpi = "libmpi"
│   abi = "MPICH"
│   pieces = "mpiexec"
│   preloads = Any[]
└   preloads_env_switch = nothing

Run script (for sbatch)

#!/bin/bash
#SBATCH -J hdf5_test
#SBATCH -o stdout_log.o%j
#SBATCH -N 1
#SBATCH -n 32

mpiexec.hydra -np $SLURM_NTASKS julia test.jl

My Env

  • Centos7.5
  • slurm with hydra
  • HDF5 1.14.1
  • GCC 13.2.0
  • MPICH 4.1.2
    (Yes, I built HDF5, GCC, and MPICH from the source)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions