Explore different scheduling strategies or memory usage optimizations in shuffles with multiple streams

Currently using 1 steam during a shuffle potentially slows shuffles down but ends up acting as back pressure reducing memory usage. Setting it to 16 streams leads to better concurrency but can often increase memory usage leading to OOM's.

Note: The folllowing suggestions are from AI since I don't understand the internals enough, but dropping them in here in case they are valuable. 

1. Add an internal adaptive limiter for shuffle GPU work. It should hand out stream permits based on device headroom and in-flight bytes. When headroom is healthy, use up to 16 streams; when headroom drops, shrink active permits toward 1 without changing the global stream pool.
2. Lazy allocate receive payload buffers. In `TagMetadataPayloadExchange`, metadata receive currently allocates the payload buffer immediately, before the data receive is actually posted. That can admit a burst of full-size buffers. Store only metadata plus payload_size, then allocate right before comm_->recv(...).
3. Cap active receive bytes and active send bytes, with a “one oversized message may proceed” escape hatch to avoid deadlock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore different scheduling strategies or memory usage optimizations in shuffles with multiple streams #1115

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Explore different scheduling strategies or memory usage optimizations in shuffles with multiple streams #1115

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions