-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
bugSomething isn't workingSomething isn't working
Description
CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.
The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:
[email protected]:101
PAMI::Protocol::Get::GetRdma<PAMI::Device::Shmem::DmaModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<P
AMI::Fifo::FifoPacket<64u,@libpami.so.3
PAMI::Protocol::Get::CompositeRGet<PAMI::Protocol::Get::RGet,@libpami.so.3
PAMI::Context::rget_impl(pami_rget_simple_t*)@libpami.so.3
[email protected]
process_rndv_msg@mca_pml_pami.so
pml_pami_recv_rndv_cb@mca_pml_pami.so
PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::Wr
apFifo<PAMI::Fifo::FifoPacket<64u,@libpami.so.3
[email protected]
mca_pml_pami_progress_wait@mca_pml_pami.so
mca_pml_pami_send@mca_pml_pami.so
PMPI_Send@libmpi_ibm.so.3
Cabana::Gather<Cabana::Halo<Kokkos::Device<Kokkos::Cuda,@()
void@()
Comm<System<Kokkos::Device<Kokkos::Cuda,@()
CbnMD<System<Kokkos::Device<Kokkos::Cuda,@()
main@()
---STACK
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working