Skip to content

ParaCoToUl/noarr-mpi

Repository files navigation

Layout-Agnostic MPI Abstraction for Modern C++

LICENSE: MIT DOI

This repository contains the proof of concept implementation of the paper Layout-Agnostic MPI Abstraction for Modern C++ and the evaluation of the proposed abstraction using a distributed GEMM kernel.

When using our work, please cite the paper:

@inproceedings{klepl2026layout,
  author="Klepl, Ji{\v{r}}{\'i} and Kruli{\v{s}}, Martin and Brabec, Maty{\'a}{\v{s}}",
  title="Layout-Agnostic MPI Abstraction for Distributed Computing in Modern C++",
  booktitle="Recent Advances in the Message Passing Interface",
  year="2026",
  publisher="Springer Nature Switzerland",
  pages="36--53",
  doi="10.1007/978-3-032-07194-1_3",
  isbn="978-3-032-07194-1"
}

About

Message Passing Interface (MPI) has been a well-established technology in the domain of distributed high-performance computing for several decades. However, one of its greatest drawbacks is a rather ancient pure-C interface. It lacks many useful features of modern languages (namely C++), like basic type-checking or support for generic code design. In this paper, we propose a novel abstraction for MPI, which we implemented as an extension of the C++ Noarr library. It follows Noarr paradigms (first-class layout and traversal abstraction) and offers layout-agnostic design of MPI applications. We also implemented a layout-agnostic distributed GEMM kernel as a case study to demonstrate the usability and syntax of the proposed abstraction. We show that the abstraction achieves performance comparable to the state-of-the-art MPI C++ bindings while allowing for a more flexible design of distributed applications.

Library

The project contains a header-only extension of the Noarr library that provides a layout-agnostic C++ abstraction for MPI. The library can be included using the following include directive (assuming an MPI implementation is available on the system and the include directory is in the include path):

#include <noarr/mpi.hpp>

The library provides a set of abstractions for MPI operations, including:

GEMM kernel implementations

To showcase the proposed abstraction, we implemented a distributed GEMM kernel using the proposed Noarr MPI abstraction and compared it with other libraries. All these implementations are included in the examples directory.

  • examples/noarr/gemm.cpp - implementation of the distributed GEMM kernel using the Noarr library and the proposed Noarr MPI abstraction.
  • examples/boost/gemm.cpp - implementation of the distributed GEMM kernel using the Boost.MPI library and serialization. This implementation ensures the layout-agnostic design of the GEMM kernel via the mdspan abstraction that is part of the C++ standard.
  • examples/boostP2P/gemm.cpp - implementation of the distributed GEMM kernel using the Boost.MPI library and point-to-point communication of matrices serialized into input/output archives. This implementation ensures the layout-agnostic design of the GEMM kernel via the mdspan abstraction that is part of the C++ standard.
  • examples/kokkosComm/gemm.cpp - implementation of the distributed GEMM kernel using the Kokkos library and the KokkosComm abstraction that enables communication over MPI.
  • examples/mpi/gemm.cpp - implementation of the distributed GEMM kernel using the MPI interface directly. This implementation ensures the layout-agnostic design of the GEMM kernel via the mdspan abstraction that is part of the C++ standard.

All implementations are based on the GEMM kernel implementation from the PolyBench/C 4.2.1 benchmark suite by Louis-Noël Pouchet et al. The original code is available at https://sourceforge.net/projects/polybench/files.

How to build

To build the project, you need to have CMake and a C++ compiler installed. The project assumes support for C++20 or later and requires MPI to be installed on your system (and the mpi.h header file must be available). The other dependency, the Noarr library, is retrieved automatically by CMake.

To build the project, run the following commands:

# Get the source code from GitHub
git clone https://github.com/jiriklepl/noarr-mpi
cd noarr-mpi

# Configure the project (also retrieves external dependencies)
./configure.sh

# Build the project
./build.sh

The script configure.sh creates a build directory and runs CMake to configure the project. The script build.sh builds the executables for all GEMM variants and configurations differing in the major dimensions of the privatized sub-matrices used in the distributed GEMM kernel and the dataset size. For each dataset size, there are eight configurations of the GEMM kernel in total, each named gemm-<framework>-<dataset-size>-<C-tile-major-dim>-<A-tile-major-dim>-<B-tile-major-dim>, where <framework> is the name of the framework used (e.g., noarr, boost, boostP2P, kokkosComm, mpi), <dataset-size> is the size of the dataset (MINI, SMALL, MEDIUM, LARGE, EXTRALARGE). The dataset sizes are defined in examples/include/gemm.hpp.

How to run

To run the script that automatically runs each of the GEMM variants using mpirun, run the following command:

./compare.sh

Each execution performs an unmeasured warm-up run followed by 20 measurement runs of a given GEMM kernel implementation, dataset size, and configuration. The execution then reports the average execution time and the standard deviation of the measurements in seconds. The output of the script is in CSV format with the following columns:

algorithm,framework,dataset,datatype,c_tile,a_tile,b_tile,i_tiles,mean_time,sd_time,valid
  • algorithm - the name of the algorithm (always gemm).
  • framework - the name of the framework used (e.g., noarr, boost, boostP2P, kokkosComm, mpi).
  • dataset - the size of the dataset (MINI, SMALL, MEDIUM, LARGE, EXTRALARGE).
  • datatype - the scalar datatype used in the GEMM kernel (for the default configuration of the project, it is always FLOAT).
  • c_tile, a_tile, b_tile - the major dimensions of the privatized sub-matrices used in the distributed GEMM kernel.
  • i_tiles - the number of tiles in the i dimension of the C matrix; the number of tiles in the j dimension is determined by the number of MPI processes.
  • mean_time - the average execution time of the GEMM kernel in seconds.
  • sd_time - the standard deviation of the execution time in seconds.
  • valid - a flag indicating whether the result of the GEMM kernel is valid (1) or not (0).

Slurm

To run the same experiment on a Slurm cluster (such as https://gitlab.mff.cuni.cz/mff/hpc/clusters), modify the command as follows (replace YOUR_ACCOUNT and YOUR_PARTITION with your Slurm account and partition):

USE_SLURM=1 NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ACCOUNT=YOUR_ACCOUNT PARTITION=YOUR_PARTITION ./compare.sh

data/compare1.csv shows a possible output of the script when run on a Slurm cluster with the specified parameters.

Visualization

Generate a virtual Python environment and enter it:

python3 -m venv .venv
. .venv/bin/activate

Upgrade pip and install the requirements:

pip install --upgrade pip
pip install -r requirements.txt
python3 gen_plots.py PATH_TO_CSV_FILE

where PATH_TO_CSV_FILE is the path to the CSV file containing the results of the GEMM kernel execution (e.g., data/compare1.csv).

Reproducing the results reported in the paper

To reproduce the results reported in the paper, run the following sequence of commands:

./configure.sh
./build.sh

mkdir -p data

# May run for over an hour; add USE_SLURM=1 to run on a Slurm cluster (and specify ACCOUNT and PARTITION)
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compare.sh > data/compare1.csv 2> data/compare1.err

# Perform further experiments for more stable results
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compare.sh > data/compare2.csv 2> data/compare2.err
# ...

# (Optional) Do the same with the ./compareScatter.sh script that was used to empirically determine the optimal logical dimension ordering
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compareScatter.sh > data/compareScatter1.csv 2> data/compareScatter1.err
# ...

./gen_plots.sh # Requires the Python virtual environment to be set up as described above

Testing

To run the tests that verify the type safety and functionality of the Noarr MPI abstraction (requires ctest), run the following command:

./test.sh

The result of the command is a list of tests and their outcomes. All tests should pass.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This repository contains the proof of concept implementation of the paper Layout-Agnostic MPI Abstraction for Modern C++.

Resources

License

Stars

Watchers

Forks

Contributors