Skip to content

Intranode GPU communication crashes in MPI called from Cabana::Gather::apply() #106

@patrickb314

Description

@patrickb314

CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.

The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:

[email protected]:101
PAMI::Protocol::Get::GetRdma<PAMI::Device::Shmem::DmaModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<P
AMI::Fifo::FifoPacket<64u,@libpami.so.3
PAMI::Protocol::Get::CompositeRGet<PAMI::Protocol::Get::RGet,@libpami.so.3
PAMI::Context::rget_impl(pami_rget_simple_t*)@libpami.so.3
[email protected]
process_rndv_msg@mca_pml_pami.so
pml_pami_recv_rndv_cb@mca_pml_pami.so
PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::Wr
apFifo<PAMI::Fifo::FifoPacket<64u,@libpami.so.3
[email protected]
mca_pml_pami_progress_wait@mca_pml_pami.so
mca_pml_pami_send@mca_pml_pami.so
PMPI_Send@libmpi_ibm.so.3
Cabana::Gather<Cabana::Halo<Kokkos::Device<Kokkos::Cuda,@()
void@()
Comm<System<Kokkos::Device<Kokkos::Cuda,@()
CbnMD<System<Kokkos::Device<Kokkos::Cuda,@()
main@()
---STACK

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions