Skip to content

Problems with ucx-conduit+PSHM in CI #7

Open
@bonachea

Description

@bonachea

Background

This issue is forked from issue #6, where the GASNet-EX configure default of --enable-pshm was restored for most Realm build configurations in 7a073d3, thereby enabling GASNet's efficient shared-memory transport, which provides huge speedups for intranode comms when running multiple processes-per-node.

Unfortunately initial CI testing with ucx-conduit+PSHM in CI led to some new failures, and as a result PSHM support was quickly re-disabled for the ucx-conduit configuration in f9d1a06. This issue exists to triage and hopefully solve the CI failures, so the PSHM enable can be restored in configs/config.ucx.release.

It's worth noting that ucx-conduit currently remains an "experimental" conduit (and likely to remain that way in the near-term), for reasons of both stability and performance. As of the current GASNet v2022.9.0 release there's very very few use cases where ucx-conduit might be preferable to either ibv-conduit (on InfiniBand systems) or ofi-conduit (on Slingshot-10 systems). Those production-quality conduits are currently both more robust and more performant than ucx-conduit in basically all our testing. So IMHO Legion users should never be using ucx-conduit in production, meaning this issue to polish Legion's use of ucx-conduit is probably low-priority.

Initial requests:

  1. The provided pipeline log reveals it was built against (1.5 year old) GASNet-EX version 2021.3.0. This is despite recent commit 973d1a5 that sets this repo's Makefile default GASNET_VERSION to the current GASNet release, so I'm guessing this an accidental oversight in the CI scripting. There have been non-trivial improvements made to both ucx-conduit and PSHM internals since 2021.3.0, so can we please re-run against the current GASNet-EX 2022.9.0 release to avoid potentially wasting time triaging already-fixed defects?
  2. Can we please try re-runs using GASNet's --enable-debug aka GASNET_DEBUG mode to enable assertions and envvar GASNET_BACKTRACE=1 to get backtraces? This might help us narrow down what's happening (e.g. if Realm happens to be breaking any checkable preconditions on GASNet calls).

ucx-conduit/terra failure mode

The ucx+PSHM failure point on the two terra tests looks like this:

WARNING: ucx-conduit is experimental and should not be used for
          performance measurements.
          Please see `ucx-conduit/README` for more details.
[0 - 7f4da2147bc0]    0.223741 {6}{realm}: network still not quiescent after 10 attempts
[1 - 7fa8a5eddbc0]    0.223742 {6}{realm}: network still not quiescent after 10 attempts
Signal Signal 6 received by node 0, process 6 received by node 1, process 100718 (thread 7fa8a5eddbc0100717 (thread 7f4da2147bc0) - obtaining backtrace

Based on the message I'm assuming something in the realm logic decided to "give up" on test program exit quiescence, presumably based on some heuristic (of which I have no knowledge). Could someone explain how that works? In particular, does it use real wallclock time (Does 0.223742 indicate it gave up after about a ~200 ms timeout?), or does it rely primarily on the latency/overheads of GASNet AM (which differs wildly between the UCX and shared-memory transports, meaning the heuristic might just need adjustment?).

Recommendation: Investigate the quiescence heuristic, and in particular the time basis for the abort condition

CC: @streichler @elliottslaughter @PHHargrove

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions