Skip to content

GASNet: applications hang with large numbers of GPUs/node #239

@elliottslaughter

Description

@elliottslaughter

GASNet applications can hang when running with a large number of GPUs per node. This occurs because Realm's GASNet layer allocates a separate output buffer for each pair of GPUs that communicate. The hang depends on the precise application communication pattern, so it may occur with some applications and not with others, even on the same hardware.

The workaround for this issue is to set -gex:obcount. While the precise value required is application-specific, there is a formula you can use to compute the worst-case value:

(4 + 2 * gpus/node) * nodes

For example, if you have 4 GPUs/node, this simplifies to 12 * nodes, because (4 + 2 * 4) = 12.

My understanding is that the long-term solution to this issue will be to dynamically allocate output buffers to GPUs so that they can be reclaimed, but in the meantime this issue will serve to document the workaround.

Split from: StanfordLegion/legion#1508 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions