figure out how to improve error messages in `flock_group_handle_create()` if communication fails

For some applications (such as `quintain-benchmark`), `flock_group_handle_create()` may be the first function to fail if a given process cannot communicate with a server.  It would be helpful to present a user-friendly long error message in this (possibly common) failure case.

One way this could be triggered is via the broader issue noted in https://github.com/mochi-hpc/mochi-margo/issues/301.  On Polaris for example you can do the following:

* start bedrock on one compute node
* use mpiexec to start multiple client processes that span more than 1 compute node

If you don't use `--no-vni` or configure Mercury environment variables for VNI usage, then Mochi will attempt to use a VNI allocated exclusively by mpiexec for the client processes and will not be able to exchange RPCs with the bedrock server.  This currently produces an error code of 3 from `flock_group_handle_create()` with no description of the underlying Mercury communication problem.

Not sure if this should actually be a Flock, Mercury, or Mochi issue, but I'm documenting it here because it is reproducable as something that ultimately manifests in Flock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

figure out how to improve error messages in `flock_group_handle_create()` if communication fails #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

figure out how to improve error messages in flock_group_handle_create() if communication fails #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

figure out how to improve error messages in `flock_group_handle_create()` if communication fails #4