Support collective style tasks

Hi, this is a feature request for distributed to support collective-style tasks. MPI-style programming is widely used in machine learning for sample-based parallelism. Examples are gradient boosting and neural networks. Both of them use some form of allreduce to aggregate gradient information.

The feature request can be divided into two parts. The first is a notion of grouped tasks, and the second is an abstraction for obtaining worker-local data without OOM. Collective communication requires all workers to present in the same communication group, which means tasks should be launched and finished together. In addition, error handling needs to be synchronized. If one of the tasks fails, then all the other tasks should also be restarted. For the second part, since collective tasks are usually aware of the workers and each task processes data residing on its local worker. It would be nice to have an abstraction in dask or distributed to obtain local partitions as iterators with data spilling support.

The feature request does not require distributed to implement communication algorithms like barrier or allreduce. Applications are likely to have their communication channels like `gloo` or `nccl`.

# Alternative

Currently, XGBoost specifies a unique worker address for each task and acquires a `MultiLock` to ensure all workers in the group are available during execution. This has the drawback of breaking the error recovery code inside distributed.
As for local data, XGBoost simply collects them as numpy arrays or pandas dataframe, which forces all the data to be loaded into memory and disregards Dask’s data spilling, leading to significant memory overhead.

# Related
- https://github.com/dask/distributed/issues/8320 is an issue about error handling with grouped tasks.
- https://github.com/dask/dask/issues/10239#issuecomment-1540385881 memory usage and reporting.
- https://github.com/dask/distributed/issues/4485 feature request for the implementation of `MultiLock`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support collective style tasks #8624

Alternative

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Support collective style tasks #8624

Description

Alternative

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions