Skip to content

prov/lnx: multi-rail messages delivered out of order #11723

@RaymondMichael

Description

@RaymondMichael

Describe the bug
When the LNX provider is using multiple NICs, messages can be delivered out of order. The lnx_select_send_endpoints() function round robins between the NICs. If message A is sent on NIC 0, then message B is sent on NIC 1. The problem is that nothing prevents message B from arriving at the destination and being matched before message A.

To Reproduce
export FI_LNX_PROV_LINKS="shm+cxi"
Run the following MPI program on two nodes with one rank each. I ran this with Open MPI 5.0.8, but the version shouldn't matter.

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

#define TRANSFERS (1024 * 1024)

static int sdata[TRANSFERS];
static int rdata[TRANSFERS];

int
main(int argc, char *argv[])
{
	int rank, size, bad_count = 0;

	for (int i = 0; i < TRANSFERS; i++) {
		sdata[i] = i;
	}

	MPI_Init(&argc, &argv);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Barrier(MPI_COMM_WORLD);

	for (int i = 0; i < TRANSFERS; i++) {
		if (rank == 0) {
			MPI_Send(&sdata[i], 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
		} else {
			MPI_Recv(&rdata[i], 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
				  MPI_STATUS_IGNORE);
		}
	}

	if (rank == 1) {
		for (int i = 0; i < TRANSFERS; i++) {
			if (rdata[i] != i)
				bad_count++;
		}

		fprintf(stderr, "Bad count %d %f\n",
				bad_count, (float)bad_count / TRANSFERS);
	}

	MPI_Finalize();
	return 0;
}

On my current system I'm seeing a 0.25% error rate, but results will vary.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions