Skip to content

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Mar 11, 2025

  • mpi4py>=4.0 was released and supports (and requires) MPI 4 features. But Lassen does not support MPI 4 in the IBM Spectrum rolling releases.
    Thus, limit the upper versions of mpi4py for now.

  • Making the compilers for h5py a bit more robust, using the Lassen-specific wrapper name. https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#Compilers

  • The Lassen TOSS4 upgrade never was shipped, so we can simplify our paths and names again. Some confusion was left here in paths that prevented a smooth install

cc @bzdjordje

Fix #5728

To Do

  • compiles
  • exe: runs without errors
  • python: runs without errors

@ax3l ax3l added bug Something isn't working install component: third party Changes in WarpX that reflect a change in a third-party library bug: affects latest release Bug also exists in latest release version machine / system Machine or system-specific issue labels Mar 11, 2025
@ax3l ax3l changed the title Lassen: No MPI4 Support Lassen: No MPI 4+ Support Mar 11, 2025
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from 517e7d3 to a8181b7 Compare March 11, 2025 23:59
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from a8181b7 to d646054 Compare March 12, 2025 01:22
`mpi4py>=4.0` were released and support MPI 4 features.
But Lassen does not support MPI 4 in the IBM Spectrum
rolling releases.

Thus, limit the upper versions of `mpi4py` for now.

Also making the compilers for `h5py` a bit more robust, using
the Lassen-specific wrapper name (hey, thanks for being special).
https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#Compilers
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch 3 times, most recently from ed63d5f to 51c7ad7 Compare March 14, 2025 21:45
TOSS4 never arrived.
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from 51c7ad7 to ffe04a9 Compare March 14, 2025 21:59
@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

Hm, I see segfaults of the form

 3: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI6Device5Shmem6PacketINS_4Fifo10FifoPacketILj64ELj4096EEEE12writePayloadERS5_Pvm+0xe8
) [0x200055f46fb8]
    ?? ??:0

 4: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI6Device9Interface11PacketModelINS0_5Shmem11PacketModelINS0_11ShmemDeviceINS_4Fifo8Wr
apFifoINS6_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSA_8IndirectINSA_6NativeEEENS3_9CMAShaddrELj256ELj512EEEEEE15postMultiPacketILj512EEEbRAT__hPFvPv
SQ_13pami_result_tESQ_mmSQ_mSQ_m+0x304) [0x200055f65ae4]
    ?? ??:0

 5: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo
8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE1EE11simple_i
mplEP11pami_send_t+0x434) [0x200055f78e74]
    ?? ??:0

 6: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send5EagerINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFi
foINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEENS3_3IBV14GpuPacketModelINSN_6DeviceELb0EEE
E9EagerImplILNS1_15configuration_tE1ELb1EE6simpleEP11pami_send_t+0x2c) [0x200055f7916c]
    ?? ??:0

 7: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(PAMI_Send+0x58) [0x200055ea5758]
    ?? ??:0

 8: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_pml_pami.so(pml_pami_send+0x6d8) [0x200055cdf6c8]
    ?? ??:0

 9: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_pml_pami.so(mca_pml_pami_isend+0x568) [0x200055ce0658]
    ?? ??:0

10: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so.3(MPI_Isend+0x160) [0x200051fb4750]
    ?? ??:0

11: ./warpx.rz() [0x1079b6b8]
    amrex::ParallelDescriptor::Message amrex::ParallelDescriptor::Asend<char>(char const*, unsigned long, int, int, ompi_communicator_t*) at ??:?

12: ./warpx.rz() [0x10383bf8]
    void amrex::communicateParticlesStart<amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor>, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >, 0>(amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor> const&, amrex::ParticleCopyPlan&, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> > const&, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >&) [clone .isra.0] at tmpxft_00018e9c_00000000-6_MultiParticleContainer.cudafe1.cpp:?

13: ./warpx.rz() [0x10385dcc]
    amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor>::RedistributeGPU(int, int, int, int, bool) at ??:?

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

and

3: 1: warpx.rz: /__SMPI_build_dir_______________________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Device::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed.

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

They go away if I remove the -M "-gpu" from the jsrun line.

  -M, --smpiargs=<SMPI args> Quoted argument list meaningful for Spectrum MPI
                             applications.

Causes segfaults
@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

@bzdjordje this fixes it for me. Maybe all you need to do is update your jsrun line.

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

Merging to show fully working example to the user in the live docs & mainline scripts.
https://warpx.readthedocs.io/en/latest/install/hpc/lassen.html

@ax3l ax3l merged commit bdcb685 into BLAST-WarpX:development Mar 15, 2025
30 of 36 checks passed
@ax3l ax3l deleted the fix-lassen-no-mpi4 branch March 15, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug: affects latest release Bug also exists in latest release version bug Something isn't working component: third party Changes in WarpX that reflect a change in a third-party library install machine / system Machine or system-specific issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rebuilding on Lassen

1 participant