Skip to content

Graph submission embeds buffers in a monolithic msgpack blob #8510

Open
@crusaderky

Description

@crusaderky

While working on #8257, I noticed that Client.submit and Client.compute embed into a single, monolithic msgpack blob the whole dask graph, without extracting buffers. This can cause a substantial slowdown in the (fairly common) case where the graph embeds large-ish (>1 MiB) constants.

It should be straightforward to change it to send buffers without embedding them in msgpack - like worker-to-worker comms do.

Reproducer

data = b"0" * 2**20
c.submit(len, data)

but also

data = numpy.zeros(2**20, dtype='u1')
da.from_array(data, chunks=-1).sum().compute()

Expected behaviour

The client->scheduler comms temporarily create and transfer over the network a tiny msgpack object which embeds the pickled callable and little else, plus a 1 MiB buffer which is never deep-copied.
Idem for the scheduler->worker comms.

Actual behaviour

The client->scheduler comms make a temporary deep-copy on the client host of the whole 1 MiB constant, which is then sent over the network as a monolithic 1 MiB msgpack stream, which is then deep-copied once again when unpacking msgpack on the scheduler.
Not sure about the scheduler->worker part of the trip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprove existing functionality or make things work betterperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions