Skip to content

Explore different scheduling strategies or memory usage optimizations in shuffles with multiple streams #1115

Description

@ayushdg

Currently using 1 steam during a shuffle potentially slows shuffles down but ends up acting as back pressure reducing memory usage. Setting it to 16 streams leads to better concurrency but can often increase memory usage leading to OOM's.

Note: The folllowing suggestions are from AI since I don't understand the internals enough, but dropping them in here in case they are valuable.

  1. Add an internal adaptive limiter for shuffle GPU work. It should hand out stream permits based on device headroom and in-flight bytes. When headroom is healthy, use up to 16 streams; when headroom drops, shrink active permits toward 1 without changing the global stream pool.
  2. Lazy allocate receive payload buffers. In TagMetadataPayloadExchange, metadata receive currently allocates the payload buffer immediately, before the data receive is actually posted. That can admit a burst of full-size buffers. Store only metadata plus payload_size, then allocate right before comm_->recv(...).
  3. Cap active receive bytes and active send bytes, with a “one oversized message may proceed” escape hatch to avoid deadlock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions