Skip to content

figure out how to improve error messages in flock_group_handle_create() if communication fails #4

@carns

Description

@carns

For some applications (such as quintain-benchmark), flock_group_handle_create() may be the first function to fail if a given process cannot communicate with a server. It would be helpful to present a user-friendly long error message in this (possibly common) failure case.

One way this could be triggered is via the broader issue noted in mochi-hpc/mochi-margo#301. On Polaris for example you can do the following:

  • start bedrock on one compute node
  • use mpiexec to start multiple client processes that span more than 1 compute node

If you don't use --no-vni or configure Mercury environment variables for VNI usage, then Mochi will attempt to use a VNI allocated exclusively by mpiexec for the client processes and will not be able to exchange RPCs with the bedrock server. This currently produces an error code of 3 from flock_group_handle_create() with no description of the underlying Mercury communication problem.

Not sure if this should actually be a Flock, Mercury, or Mochi issue, but I'm documenting it here because it is reproducable as something that ultimately manifests in Flock.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions