This repository contains the proof of concept implementation of the paper Layout-Agnostic MPI Abstraction for Modern C++ and the evaluation of the proposed abstraction using a distributed GEMM kernel.
When using our work, please cite the paper:
@inproceedings{klepl2026layout,
author="Klepl, Ji{\v{r}}{\'i} and Kruli{\v{s}}, Martin and Brabec, Maty{\'a}{\v{s}}",
title="Layout-Agnostic MPI Abstraction for Distributed Computing in Modern C++",
booktitle="Recent Advances in the Message Passing Interface",
year="2026",
publisher="Springer Nature Switzerland",
pages="36--53",
doi="10.1007/978-3-032-07194-1_3",
isbn="978-3-032-07194-1"
}Message Passing Interface (MPI) has been a well-established technology in the domain of distributed high-performance computing for several decades. However, one of its greatest drawbacks is a rather ancient pure-C interface. It lacks many useful features of modern languages (namely C++), like basic type-checking or support for generic code design. In this paper, we propose a novel abstraction for MPI, which we implemented as an extension of the C++ Noarr library. It follows Noarr paradigms (first-class layout and traversal abstraction) and offers layout-agnostic design of MPI applications. We also implemented a layout-agnostic distributed GEMM kernel as a case study to demonstrate the usability and syntax of the proposed abstraction. We show that the abstraction achieves performance comparable to the state-of-the-art MPI C++ bindings while allowing for a more flexible design of distributed applications.
The project contains a header-only extension of the Noarr library that provides a layout-agnostic C++ abstraction for MPI. The library can be included using the following include directive (assuming an MPI implementation is available on the system and the include directory is in the include path):
#include <noarr/mpi.hpp>The library provides a set of abstractions for MPI operations, including:
mpi_transform(include/noarr/mpi/transform.hpp) - a function that transforms Noarr structures to MPI datatypes.mpi_traverser_t(include/noarr/mpi/traverser.hpp) - a class that associates a Noarr traverser with an MPI communicator.scatter,gather,broadcast(include/noarr/mpi/algorithms.hpp) - functions that implement collective operations for Noarr structures.
To showcase the proposed abstraction, we implemented a distributed GEMM kernel using the proposed Noarr MPI abstraction and compared it with other libraries. All these implementations are included in the examples directory.
- examples/noarr/gemm.cpp - implementation of the distributed GEMM kernel using the Noarr library and the proposed Noarr MPI abstraction.
- examples/boost/gemm.cpp - implementation of the distributed GEMM kernel using the Boost.MPI library and serialization. This implementation ensures the layout-agnostic design of the GEMM kernel via the
mdspanabstraction that is part of the C++ standard. - examples/boostP2P/gemm.cpp - implementation of the distributed GEMM kernel using the Boost.MPI library and point-to-point communication of matrices serialized into input/output archives. This implementation ensures the layout-agnostic design of the GEMM kernel via the
mdspanabstraction that is part of the C++ standard. - examples/kokkosComm/gemm.cpp - implementation of the distributed GEMM kernel using the Kokkos library and the KokkosComm abstraction that enables communication over MPI.
- examples/mpi/gemm.cpp - implementation of the distributed GEMM kernel using the MPI interface directly. This implementation ensures the layout-agnostic design of the GEMM kernel via the
mdspanabstraction that is part of the C++ standard.
All implementations are based on the GEMM kernel implementation from the PolyBench/C 4.2.1 benchmark suite by Louis-Noël Pouchet et al. The original code is available at https://sourceforge.net/projects/polybench/files.
To build the project, you need to have CMake and a C++ compiler installed. The project assumes support for C++20 or later and requires MPI to be installed on your system (and the mpi.h header file must be available). The other dependency, the Noarr library, is retrieved automatically by CMake.
To build the project, run the following commands:
# Get the source code from GitHub
git clone https://github.com/jiriklepl/noarr-mpi
cd noarr-mpi
# Configure the project (also retrieves external dependencies)
./configure.sh
# Build the project
./build.shThe script configure.sh creates a build directory and runs CMake to configure the project. The script build.sh builds the executables for all GEMM variants and configurations differing in the major dimensions of the privatized sub-matrices used in the distributed GEMM kernel and the dataset size. For each dataset size, there are eight configurations of the GEMM kernel in total, each named gemm-<framework>-<dataset-size>-<C-tile-major-dim>-<A-tile-major-dim>-<B-tile-major-dim>, where <framework> is the name of the framework used (e.g., noarr, boost, boostP2P, kokkosComm, mpi), <dataset-size> is the size of the dataset (MINI, SMALL, MEDIUM, LARGE, EXTRALARGE). The dataset sizes are defined in examples/include/gemm.hpp.
To run the script that automatically runs each of the GEMM variants using mpirun, run the following command:
./compare.shEach execution performs an unmeasured warm-up run followed by 20 measurement runs of a given GEMM kernel implementation, dataset size, and configuration. The execution then reports the average execution time and the standard deviation of the measurements in seconds. The output of the script is in CSV format with the following columns:
algorithm,framework,dataset,datatype,c_tile,a_tile,b_tile,i_tiles,mean_time,sd_time,validalgorithm- the name of the algorithm (alwaysgemm).framework- the name of the framework used (e.g.,noarr,boost,boostP2P,kokkosComm,mpi).dataset- the size of the dataset (MINI,SMALL,MEDIUM,LARGE,EXTRALARGE).datatype- the scalar datatype used in the GEMM kernel (for the default configuration of the project, it is alwaysFLOAT).c_tile,a_tile,b_tile- the major dimensions of the privatized sub-matrices used in the distributed GEMM kernel.i_tiles- the number of tiles in theidimension of theCmatrix; the number of tiles in thejdimension is determined by the number of MPI processes.mean_time- the average execution time of the GEMM kernel in seconds.sd_time- the standard deviation of the execution time in seconds.valid- a flag indicating whether the result of the GEMM kernel is valid (1) or not (0).
To run the same experiment on a Slurm cluster (such as https://gitlab.mff.cuni.cz/mff/hpc/clusters), modify the command as follows (replace YOUR_ACCOUNT and YOUR_PARTITION with your Slurm account and partition):
USE_SLURM=1 NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ACCOUNT=YOUR_ACCOUNT PARTITION=YOUR_PARTITION ./compare.shdata/compare1.csv shows a possible output of the script when run on a Slurm cluster with the specified parameters.
Generate a virtual Python environment and enter it:
python3 -m venv .venv
. .venv/bin/activateUpgrade pip and install the requirements:
pip install --upgrade pip
pip install -r requirements.txtpython3 gen_plots.py PATH_TO_CSV_FILEwhere PATH_TO_CSV_FILE is the path to the CSV file containing the results of the GEMM kernel execution (e.g., data/compare1.csv).
To reproduce the results reported in the paper, run the following sequence of commands:
./configure.sh
./build.sh
mkdir -p data
# May run for over an hour; add USE_SLURM=1 to run on a Slurm cluster (and specify ACCOUNT and PARTITION)
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compare.sh > data/compare1.csv 2> data/compare1.err
# Perform further experiments for more stable results
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compare.sh > data/compare2.csv 2> data/compare2.err
# ...
# (Optional) Do the same with the ./compareScatter.sh script that was used to empirically determine the optimal logical dimension ordering
NUM_TASKS=8 NUM_NODES=8 I_TILES=2 ./compareScatter.sh > data/compareScatter1.csv 2> data/compareScatter1.err
# ...
./gen_plots.sh # Requires the Python virtual environment to be set up as described aboveTo run the tests that verify the type safety and functionality of the Noarr MPI abstraction (requires ctest), run the following command:
./test.shThe result of the command is a list of tests and their outcomes. All tests should pass.
This project is licensed under the MIT License - see the LICENSE file for details.