Currently using 1 steam during a shuffle potentially slows shuffles down but ends up acting as back pressure reducing memory usage. Setting it to 16 streams leads to better concurrency but can often increase memory usage leading to OOM's.
Note: The folllowing suggestions are from AI since I don't understand the internals enough, but dropping them in here in case they are valuable.
- Add an internal adaptive limiter for shuffle GPU work. It should hand out stream permits based on device headroom and in-flight bytes. When headroom is healthy, use up to 16 streams; when headroom drops, shrink active permits toward 1 without changing the global stream pool.
- Lazy allocate receive payload buffers. In
TagMetadataPayloadExchange, metadata receive currently allocates the payload buffer immediately, before the data receive is actually posted. That can admit a burst of full-size buffers. Store only metadata plus payload_size, then allocate right before comm_->recv(...).
- Cap active receive bytes and active send bytes, with a “one oversized message may proceed” escape hatch to avoid deadlock.
Currently using 1 steam during a shuffle potentially slows shuffles down but ends up acting as back pressure reducing memory usage. Setting it to 16 streams leads to better concurrency but can often increase memory usage leading to OOM's.
Note: The folllowing suggestions are from AI since I don't understand the internals enough, but dropping them in here in case they are valuable.
TagMetadataPayloadExchange, metadata receive currently allocates the payload buffer immediately, before the data receive is actually posted. That can admit a burst of full-size buffers. Store only metadata plus payload_size, then allocate right before comm_->recv(...).