Stream spilled data directly from disk to other workers without loading into memory (`sendfile`-style)

This has been discussed in a few other places, but I think it's important enough (and a separate enough task) to warrant its own issue for tracking purposes. Related to:
* https://github.com/dask/distributed/issues/4424#issuecomment-1058478718
* https://github.com/dask/distributed/issues/5900

Once https://github.com/dask/distributed/issues/5900 is done, when one worker requests data from another, and that data is currently spilled to disk, we should not require slurping all the data into memory just to send it. Ideally, we'd stream the serialized bytes from disk directly over the network socket, without copying into userspace at all, via something like [`socket.sendfile`](https://docs.python.org/3.8/library/socket.html#socket.socket.sendfile)/[`sendfile(2)`](https://man7.org/linux/man-pages/man2/sendfile.2.html). If this is not deemed possible, copying into a small userspace buffer and doing incremental sends would still be an improvement (but not ideal).

(As a bonus, it could be nice if workers could also _receive_ data directly to disk. This might let us keep fetching dependencies even when workers are under memory pressure. This might make computations complete faster when workers are under memory pressure, but probably won't make the difference between them succeeding/failing like the sending side does, so it's less important for now.)

The biggest challenge is that the comms interface doesn't currently support streams, only individual messages. The `get_data` response can also [contain multiple keys](https://github.com/dask/distributed/blob/1b4d13fe4f6898bb825bd9e58dc3e795a7230be5/distributed/worker.py#L1648), which further complicates the interface.

From a technical perspective, doing `sendfile` over a socket should be easy. The challenge is just around making that feasible to do with our comms interface.

I believe this is one of the highest-impact changes we can make to improve memory performance and reduce out-of-memory failures. There are a few flaws currently with spill-to-disk, but to me this is the biggest: transfer-heavy graphs basically work against spilling to disk, because workers are currently un-spilling some/most of the data they just spilled in order to send it to their peers.

cc @fjetter @crusaderky 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Stream spilled data directly from disk to other workers without loading into memory (`sendfile`-style) #5996

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Stream spilled data directly from disk to other workers without loading into memory (sendfile-style) #5996

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Stream spilled data directly from disk to other workers without loading into memory (`sendfile`-style) #5996