For some applications (such as quintain-benchmark), flock_group_handle_create() may be the first function to fail if a given process cannot communicate with a server. It would be helpful to present a user-friendly long error message in this (possibly common) failure case.
One way this could be triggered is via the broader issue noted in mochi-hpc/mochi-margo#301. On Polaris for example you can do the following:
- start bedrock on one compute node
- use mpiexec to start multiple client processes that span more than 1 compute node
If you don't use --no-vni or configure Mercury environment variables for VNI usage, then Mochi will attempt to use a VNI allocated exclusively by mpiexec for the client processes and will not be able to exchange RPCs with the bedrock server. This currently produces an error code of 3 from flock_group_handle_create() with no description of the underlying Mercury communication problem.
Not sure if this should actually be a Flock, Mercury, or Mochi issue, but I'm documenting it here because it is reproducable as something that ultimately manifests in Flock.