Skip to content

Comments/suggestions for registering an nspace #58

@rhc54

Description

@rhc54

Looking at the issues list, it appears you may have been basing your work on the Slurm PMIx plugin. Understandable, but actually that is a poor choice as it (a) is missing things that OMPI and CH4 needs/wants, and (b) actually has some errors in it. We've been trying for some time to get them to fix it, without success. Both OMPI and CH4 will run with that plugin - they both just have to make assumptions that may reduce performance or require a greater exchange of information at startup.

FWIW: OMPI will not fallback to a data exchange. If it cannot get info via PMIx, it simply makes a lowest denominator assumption. I don't know CH4's policy, but suspect it does the same based on my last view of it.

The better "model" would be to use what is in PRRTE - see the code here. The Standard itself is also a good source for what procs are expecting to see (relevant section is here).

I know you don't currently support MPI Sessions, Spawn, and fault tolerance, so we can safely ignore those related values. The PMIx library will "backfill" some things - e.g., we internally will use the regex maps you provide to compute the local peers, local size, etc if you don't give them to us. However, there are some things we cannot backfill for you that both programming libraries can use, if you have the ability to provide them.

For example, if you know the topology of each node, then you could pass down the cpuset of each process in the job and its device distances for NICs on its node (there is a PMIx API for computing the distances for you). This allows each process to determine which interface its peer will be using, which in turn allows for things like collective optimization (CH4 uses this - OMPI probably will soon as well).

It is also good to let PMIx help you setup the application via the PMIx_server_setup_application API. This is where the support for Slingshot and friends is done. It needs to be executed by the equivalent of mpirun and the returned data included in the "blob" sent to the daemons that will be hosting application processes. If you pass in the programming library name (e.g., "ompi" or "mpich"), PMIx will add whatever supporting info those libraries require/desire. In addition, we can assign network security keys and even assign endpoints (CH4 particularly uses that feature). There are a couple of APIs you would need to call on your backend daemons when fork/exec'ing the application processes - I can walk you thru them if you want to pursue it.

Again, I can point you to the relevant PRRTE areas if you want to use them as a guide - or I'm happy to provide advice. I suspect the "setup application" and distributing the returned info is the main thing that will impact you as otherwise you'd have to generate the info yourself that each library wants - and that can be a bit of a moving target, which is why we built it into PMIx so everyone got updated at the same time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions