Skip to content

Conversation

@JStewart28
Copy link
Contributor

@JStewart28 JStewart28 commented Aug 22, 2025

Classes used for MPI communication have been restructured to support the addition of different communication backends. Mpi is currently the only communication backend. The Mpi backend contains the code already used in Cabana for communication. The restructuring was done as discussed in previous emails.

Changes:

  • Core:
    1. Cabana_CommunicationPlan.hpp renamed to Cabana_CommunicationPlanBase.hpp.
    2. New type tag Cabana::CommSpace::Mpi added to Cabana_CommunicationPlanBase.hpp.
    3. CommunicationPlan and CommunicationData renamed to CommunicationPlanBase and CommunicationDataBase.
      • Base class contains variables and functions needed for all communication backends.
      • Classes that implement a specific communication backend are derived from the base class and located in impl/Cabana_CommunicationPlan_{backend_name}.hpp.
      • The default communication space is CommSpace::Mpi to maintain backwards compatibility.
    4. In Cabana_Halo.hpp, the Halo class is now typed on CommSpaceType, which is CommSpace::Mpi by default. The Gather and Scatter classes extract the CommSpaceType from the HaloType.
      • In the Gather and Scatter classes, the correct applyImpl function, which performs the gather or scatter operation, is enabled based on the communication space the Halo is created in.
      • Implementations of the applyImpl functions for Gather and Scatter are located in impl/Cabana_Halo_{backend_name}.hpp.
    5. In Cabana_Distributor.hpp, the Distributor class is now typed on CommSpaceType, which is CommSpace::Mpi by default.
      • The communication parts of the migrate functions have been moved to Impl::migrateData and Impl::migrateSlice functions. These functions are typed on the communication space the Distributor is created in. Specific implementations are in impl/Cabana_Migrate_{backend_name}.hpp.
    6. The test frameworks of the CommunicationPlan, Distributor, and Halo tests have been modified to support testing all communication backends. Currently, CommSpace::Mpi is the only backend tested. (Because it is the only backend that exists)
    7. Added Cabana_Tags.hpp where communication backend and import/export type tags are stored.
    8. All #includes and CMake files have been modified to have the new file names.
  • Grid:
    2. Cabana_Grid_Halo.hpp renamed to Cabana_Grid_HaloBase.hpp.
    - Base class contains variables and functions needed for all communication backends.
    - Classes that implement a specific communication backend are derived from the base class and located in impl/Cabana_Grid_Halo_{backend_name}.hpp.
    - The default communication space is Grid::CommSpace::Mpi to maintain backwards compatibility.
    3. The test frameworks of the halo2d and halo3d tests have been modified to support testing all communication backends. Currently, Mpi is the only backend tested. (Because it is the only backend that exists)
    4. All #includes and CMake files have been modified to have the new file names.

Other notes and questions:

  • The CommSpace::Mpi tag is declared in both Core and in Grid. This may get confusing and there is currently no way to globally define a tag for both Grid and Core. What do you all think? - Update: See Changes->Core->item vii.
  • SparseHalo has not been modified to support various backends in this pull request but may be modified in the future.
  • The first planned backend after Mpi is MpiAdvance which implements locality-aware communication.

Copy link
Member

@streeve streeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great - first round of clarifications/suggestions

@JStewart28
Copy link
Contributor Author

Just a few more small comments - trying to get through all of this even though a lot of lines are just moved. One thing I wanted to point out is the usage of comm inside Cabana_Grid_GlobalParticleComm.hpp and Cabana_Grid_ParticleDistributor.hpp which will need some extra tags/defaults

I added a CommSpaceType=Mpi template parameter to the relevant functions.

@JStewart28
Copy link
Contributor Author

@streeve any updates on this?

@JStewart28 JStewart28 requested a review from streeve October 20, 2025 20:30
@streeve
Copy link
Member

streeve commented Oct 24, 2025

@JStewart28 sorry for the delay again - I just fixed conflicts with main (we fixed a serious Gather class issue in #849 that overlapped with these changes). I'm seeing a memory error that I can't figure out - I'll take another look, but maybe it's more obvious to you

@JStewart28
Copy link
Contributor Author

JStewart28 commented Oct 24, 2025

@JStewart28 sorry for the delay again - I just fixed conflicts with main (we fixed a serious Gather class issue in #849 that overlapped with these changes). I'm seeing a memory error that I can't figure out - I'll take another look, but maybe it's more obvious to you

Thanks, @streeve. I fixed the CI errors by editing the struct MySoA in core/src/unit_test/tstAoSoA.hpp:

template <int VLEN, int D1, int D2, int D3>
struct MySoA
{
    float m0[D1][D2][D3][VLEN];
    int m1[VLEN];
    double m2[D1][VLEN];
    double m3[D1][D2][VLEN];
    
    KOKKOS_INLINE_FUNCTION
    MySoA(){};
};

Adding the constructor fixes the Ubuntu/OpenSUSE warnings and adding KOKKOS_INLINE_FUNCTION fixes a "calling a host function on the device" error when building with HIP.

@streeve
Copy link
Member

streeve commented Oct 28, 2025

Thanks @JStewart28 that's been blocking a few different things - do you mind splitting the CI fix into a separate PR? I built Cabana locally with -DWITH_ASAN=ON and see a memory error for just the Halo tests. I haven't been able to track down any more detail, maybe because I'm not very familiar with the typed tests yet

@JStewart28
Copy link
Contributor Author

Thanks @JStewart28 that's been blocking a few different things - do you mind splitting the CI fix into a separate PR? I built Cabana locally with -DWITH_ASAN=ON and see a memory error for just the Halo tests. I haven't been able to track down any more detail, maybe because I'm not very familiar with the typed tests yet

Yes. Opened PR #858. I'll look into the memory error and let you know if I can figure it out.

@JStewart28
Copy link
Contributor Author

@streeve can you provide more details about the memory error? The AddressSanitizer is saying there is a double-free, but it says that for any test I run, including tests such as DeepCopy and CartesianGrid. I also get the double-free message if I run the Halo test (or any other test) from the master branch of Cabana. This is on WSL/Windows.

Copy link
Member

@streeve streeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question/potential change and I'm ready to merge!

@streeve
Copy link
Member

streeve commented Nov 3, 2025

@streeve can you provide more details about the memory error? The AddressSanitizer is saying there is a double-free, but it says that for any test I run, including tests such as DeepCopy and CartesianGrid. I also get the double-free message if I run the Halo test (or any other test) from the master branch of Cabana. This is on WSL/Windows.

Sounds like we need to investigate this further, but that we shouldn't block this PR any further

@streeve streeve merged commit 8fb4b3e into ECP-copa:master Nov 4, 2025
34 checks passed
@streeve streeve deleted the add-commspaces branch November 4, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants