Skip to content

file based bp5 writer hang  #1655

Open
BLAST-WarpX/warpx
#5634
@guj

Description

@guj

Describe the bug
The recent optimization breaks a MPI use case when in file based mode. A minimal code is included below. One can use 2 ranks to see effect.
In short, at the second flush, rank 1 has nothing to contribute, so it didn't call BP5 while rank 0 did. In essence, BP5 write is collective. So rank 0 hangs because inactivity of rank 1.
If we use variable based, it looks like a flush to ADIOS is forced (by openPMD-api? ) on all ranks and so it works.

To Reproduce
c++ example:

#include <openPMD/openPMD.hpp>
#include <mpi.h>
#include <iostream>
#include <memory>

using std::cout;
using namespace openPMD;

int main(int argc, char *argv[])
{
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

    int mpi_size;
    int mpi_rank;

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    auto const value = float(mpi_size*100+mpi_rank);
    std::vector<float> local_data(10 * 300, value);

    std::string filename = "ptl_%T.bp";
    //std::string filename = "ptl.bp";  //this is variable based and it works                                                                                                        

    Series series = Series(filename, Access::CREATE, MPI_COMM_WORLD);

    Datatype datatype = determineDatatype<float>();

    auto myptl = series.writeIterations()[1].particles["ion"];
    Extent global_ptl = {10ul * mpi_size * 300};
    Dataset dataset_ptl = Dataset(datatype, global_ptl, "{}");
    myptl["charge"].resetDataset(dataset_ptl);

    series.flush();

    if (mpi_rank == 0)     // only rank 0 adds data                                                                                                                                  
        myptl["charge"].storeChunk(local_data, {0}, {3000});

    series.flush(); // hangs here                                                                                                                                                    
    MPI_Finalize();

    return 0;
}

Software Environment

  • version of openPMD-api: latest
  • machine: Mac
  • version of ADIOS2: latest

Additional context

  • I used OPENPMD_ADIOS2_BP5_TypeAgg=EveryoneWritesSerial, but any choice of aggregation will fail.
  • run with 2 cores is enough to see file based has issue.
  • As far as I was aware, this use case worked fine not long ago.
  • It does not affect HDF5.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions