Skip to content

MPI_Rank Object

Besnard Jean-Baptiste edited this page Mar 13, 2018 · 6 revisions

What if the Rank was an Object

This document is a work in progress to describe the open idea of having the rank bound to an MPI_Object. Note that it is very futuristic and just aims at playing with the MPI_Rank object idea.

This page is a work in progress. No idea here is finalized! If you don't like or if maybe you like some of these ideas feel free to join the HW topology working group to help develop these ideas.

Rationale

In this wiki page, we explore the idea of replacing the current MPI rank which is an integer to create a more complex object. The initial goal is to cover the topological nature of these ranks with respect to each other and relative to the whole computing system. As a side effect, we may also explore more futuristic ideas.

Currently the rank is as follows:

int rank;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );

And we would like to explore the consequence of the following:

MPI_Rank rank;
MPI_Comm_rank( "URI", &rank );

We directly see:

  • It is now possible to attach information to the MPI_Rank object in MPI-T (as outlined by Martin)
  • Ranks can be more complex than integers (at first 64 bit !) and may hide extra-meta data (see later example)
  • Is MPI bound to the rank object -- should the rank object expose the whole interface?

On the Semantic of the Source

In MPI the source rank is omitted inside calls as it is seen as the current rank of the calling process for the communicator passed in parameter. For this reason in the Endpoint Proposal, the handle differentiating between endpoints is a communicator hosting a single rank. It is, in fact, the only way to differentiate between multiple endpoints as MPI does not have the notion of MPI_Rank which is practically an endpoint objectifying the local instance of MPI.

With an MPI_Rank object one may instantiate the whole interface as part of the rank and issue topological queries:

MPIX_Rank mpi;
/* Get my rank in the *comm_world* group */
MPIX_Init( "mpi://comm_world", &mpi );

int size;
mpi.size(&size);

/* Get remote rank in the *comm_world* group using a 1D addressing */
MPIX_Rank remote;
mpi.resolve1D( mpi.id1D() + 1, &remote)

char data[100];
/* No need for comms as it is hidden in the *remote* object */
mpi.Send(remote, data, 100, MPI_CHAR, 123 /* TAG */ );

Rank as linked to the topology

As the IP address for IP networks, the rank may have multiple meanings including outside of the MPI job. Currently, MPI allows you to manipulate ranks from COMM_WORLD or communicator spawns in a very rich manner. One of the main lack is the external manipulation of these ranks as they have no meaning outside of job's boundaries -- despite PMI efforts.

This yields the idea of machine-wide ranks which are dependent on the batch manager topology (as outlined in MPI-T query proposal). To illustrate this think that your batch manager is a DHCP server and that the action of launching a job is that you request a lease from the bach with given resources.

Following the MPI-T split, consider that your slurm has a 3D space to allocate:

  • Dimension 1: Nodes
  • Dimension 2: Sockets
  • Dimension 3: Cores (on each socket)

One can see that these coordinates are hierarchical. We have two nodes, two sockets per node and two cores per socket:

0                                  1
0          1                       0           1
0  1       0   1                   0  1        0  1  

Now consider that we allocate a four process job with two cores per process each MPI_Rank would have the following topological ranks:

$ srun -N=2 -p=2 -c=2 ./a.out
0 - (0,0,*)
1 - (0,1,*)
2 - (1,0,*)
3 - (1,1,*)

One can see that ranks are located in the middle of the hierarchy. It means in this case that the batch manager handles the hierarchy. If the batch manager was not handling the hierarchy (providing 1D leases) we could have ended up with a scalar rank which is close to what MPI provides on COMM_WORLD and it would not be possible to infer any distance -- it is up to the batch manager to define how it enumerates resources.

The rank is now bound to the topology tree and the Endpoint semantic is easy to derive, the user can create an endpoint which is callable as long as it is covered by the hierarchy of the calling rank:

int endpoints_fn(MPI_Rank endpoint)
{
        /* Do work for the rank */
}


main( ... )
{
        MPIX_Rank mpi;
        /* Get my rank in the *comm_world* group */
        MPIX_Init( "mpi://comm_world", &mpi );

        /* This will return 2 in the above example */
        int local_size;
        mpi.local_size(&local_size);

        /* This creates two pinned threads with new ranks
           running at their respective topological coordinates */
        mpi.local_spawn(endpoints_fn);
}

One can see that local ranks may not span outside of a node dimension, how can MPI know it?

From this, the MPI may require from the batch a given topology in the COMM_WORLD, this is inspired by the Fujitsu Example and it could be exposed in the program as follows:

MPIX_Rank mpi;
/* Get my rank in the *comm_world* group */
MPIX_Init( "mpi://comm_world", &mpi );

int size;
mpi.size(&size);

int dimension;
mpi.dimension(&dimension);

/* Dimension can be two for example */
MPIX_Rank remote;
mpi.resolveND( [ mpi.idND(dimension, 0) + 1, mpi.idND(dimension , 1) ],
                 dimension , &remote)

char data[100];
/* No need for comms as it is hidden in the *remote* object */
mpi.Send(remote, data, 100, MPI_CHAR, 123 /* TAG */ );

Distance

With a rank object, the topology could be exposed either through dedicated functions or with MPI-T which is now capable of attaching data to a given rank.

Example of topological query over a rank object:

/* Get group dimension */
mpi.dimension(int *dimension);

/* Get remote rank according to group topology */
mpi.resolveND( int * coordinates, int dimension, MPI_Rank * rank );

/* Get nearest rank in current set */
mpi.nearest( MPI_Rank * rank )

/* Get Nth nearest */
mpi.knearest( int k, MPI_Rank * rank );

/* Get topological neighbors */
int neighbors_count:
mpi.neighbors_size(&neighbors_count);

MPI_Rank *neighbors = malloc(neighbors_count * sizeof(MPI_Rank));
mpi.neighbors( neighbors );

/* Compute distance with rank */
double distance:
mpi.distance( MPI_Rank remote );

MPI_T approach

With MPI-T is possible to attach over to any object while returning an arbitrary number of elements of the TYPE queried with MPI_T_PVAR get info. It would then be straightforward to return an array describing the coordinates when querying the hardware address of a given rank -- for example.

Stolen from the standard:

MPI_T_pvar_handle_alloc(MPI_T_pvar_session session, int pvar_index,
                            void *obj_handle, MPI_T_pvar_handle *handle, int *count)

This routine binds the performance variable specified by the argument index to an
MPI object in the session identified by the parameter session. The object is passed in the
argument obj_handle as an address to a local variable that stores the object’s handle. The
argument obj_handle is ignored if the MPI_T_PVAR_GET_INFO call for this performance
variable returned MPI_T_BIND_NO_OBJECT in the argument bind. The handle allocated to
reference the variable is returned in the argument handle. Upon successful return, count
contains the number of elements (of the datatype returned by a previous
MPI_T_PVAR_GET_INFO call) used to represent this variable.

        Advice to users. The count can be different based on the MPI object to which the performance variable was bound. For example, variables bound to communicators
        could have a count that matches the size of the communicator.
        It is not portable to pass references to predefined MPI object handles, such as
        MPI_COMM_WORLD, to this routine, since their implementation depends on the MPI
        library. Instead, such an object handle should be stored in a local variable and the address of this local variable should be passed into MPI_T_PVAR_HANDLE_ALLOC.
        (End of advice to users.)

The value of index should be in the range 0 to num_pvar − 1, where num_pvar is the
number of available performance variables as determined from a prior call to
MPI_T_PVAR_GET_NUM. The type of the MPI object it references must be consistent
with the type returned in the bind argument in a prior call to MPI_T_PVAR_GET_INFO.
For all routines in the rest of this section that take both handle and session as IN
or INOUT arguments, if the handle argument passed in is not associated with the session
argument, MPI_T_ERR_INVALID_HANDLE is returned.

Inter-Job message passing in MPI

One advantage of the hardware IDs is that any execution stream is now identified in a unique manner. With the abstraction of Groups (which are sessions), you can easily resolve (as with a DNS request) a given set of processes including if it is remote -- anywhere on the machine. This explores the connection with the "Session Idea". It poses the question of replacing services relying on sockets with MPI. It somewhat related to message-queues protocols:

Let's consider:

  • A "central" IO server running on MPI
  • A job acting as an IO proxy
  • A job doing some useful computation

If we use the idea from sessions with uri we can launch:

    srun -@ "mpi://private/io_proxy" -N 2 -n 2 -c 32 ./io_cache_engine
    srun -@ "mpi://private/simu" -N 16 -n 16 -c 32 ./my_simu

On the admin side someone ran:

    srun -@ "mpi://io" -N 16 -n 16 -c 32 ./my_simu

The proxy may communicate with the IO server:

MPIX_Rank mpi;
/* Get my rank in the IO group
   is we were not launched in this group it fails*/
MPIX_Init( "mpi://private/io_proxy", &mpi );

int size;
mpi.size(&size);

/* Now pick a random rank in the IO group
   this succeed as the IO group is in a global namespace */
MPIX_Rank remote;
mpi.pick( "mpi://io" , &remote)

char data[100];
/* No need for comms as it is hidden in the *remote* object */
mpi.Send(remote, data, 100, MPI_CHAR, 123 /* TAG */ );

How "my_simu" connects to the IO proxy:

MPIX_Rank mpi;
/* Get my rank in the IO group
   is we were not launched in this group it fails*/
MPIX_Init( "mpi://private/simu", &mpi );

int size;
mpi.size(&size);

/* Now pick a random rank in the IO proxy group */
MPIX_Rank remote;
mpi.pick( "mpi://private/io_proxy" , &remote)

char data[100];
/* No need for comms as it is hidden in the *remote* object */
mpi.Send(remote, data, 100, MPI_CHAR, 123 /* TAG */ );

The topological rank could then be a way to address inside MPI Sessions which to pursue our IP analogy are the DNS alias attached to a given group of topological ranks. One point to observe is that it supposes a close collaboration with the batch manager which ends-up being the angular component addressing jobs and connecting them (it is already the case with the PMI). It behaves like a big DHCP server.

With this semantic, it becomes easier to replace sockets with MPI as it is now possible to expose machine-wide services through MPI by providing an alternative namespace (topological addresses instead of IP addresses) and addressing (Session URIs instead of Ports). This way high-performance network cards can be taken advantage of in a more regular "client-server" model such as the one used in Kubernetes. Note to cover such use-case MPI should have convincing "Protection Domains" (https://kubernetes.io/docs/concepts/cluster-administration/networking/) and this may reliably be provided by the batch instead of relying on Layer3+ abstractions.

Moreover, it opens the way for more orthogonal computations as explored in length by the Session WG (In-Situ, Tools ...)

Still, MPI may require some extensions in the Client-Server model (encouraged by the setup we present in this Section) as it mainly focusses on two-sided messages (for performance). This has been fixed with Mprobe but with such process layout, Active-Messages become of interest in MPI (but this is another story).

Tele-con Minutes

Clone this wiki locally