-
Notifications
You must be signed in to change notification settings - Fork 10
Description
GASNet applications can hang when running with a large number of GPUs per node. This occurs because Realm's GASNet layer allocates a separate output buffer for each pair of GPUs that communicate. The hang depends on the precise application communication pattern, so it may occur with some applications and not with others, even on the same hardware.
The workaround for this issue is to set -gex:obcount
. While the precise value required is application-specific, there is a formula you can use to compute the worst-case value:
(4 + 2 * gpus/node) * nodes
For example, if you have 4 GPUs/node, this simplifies to 12 * nodes
, because (4 + 2 * 4) = 12
.
My understanding is that the long-term solution to this issue will be to dynamically allocate output buffers to GPUs so that they can be reclaimed, but in the meantime this issue will serve to document the workaround.
Split from: StanfordLegion/legion#1508 (comment)