Skip to content

Deserialization of compressed data is sluggish and causes memory flares #7433

Open
@crusaderky

Description

@crusaderky

If lz4 is installed and a chunk can be compressed efficiently, it is compressed automatically.
This happens both in network comms (configurable) and upon spilling (not configurable).

While working on coiled/benchmarks#629, I found out that deserializing a compressed frame is abysmally slow:

image

When deserialization is triggered by network comms, it is offloaded to a separate thread for any message larger than 10 MiB (distributed.comm.offload). Note that a message will contain multiple small keys if possible.
I did not test if the GIL is released while this happens.

When deserialization is triggered by unspilling, the whole event loop is temporarily blocked.

Microbenchmark

Serializing a 128 MiB compressible numpy array takes 9.96ms; there's a modest overhead on top of the raw call to lz4, which takes 9.75ms.
Deserializing the same takes a whopping 144ms. The raw lz4 decompress takes an already huge 83ms.

At the very least we should be able to trim down the overhead. Also the raw decompression speed is suspiciously slow - typically, compression takes much longer than decompression so I suspect that something must be going on upstream.

import io
import lz4.frame
import numpy
from distributed.protocol import deserialize_bytes, serialize_bytelist

a1 = numpy.random.random((4096, 4096))  # 128 MiB, uncompressible
a2 = numpy.ones((4096, 4096))  # 128 MiB, highly compressible

buf1 = io.BytesIO()
buf2 = io.BytesIO()
buf1.writelines(serialize_bytelist(a1))
buf2.writelines(serialize_bytelist(a2))
b1 = buf1.getvalue()
b2 = buf2.getvalue()
b3 = lz4.frame.compress(a2)

print("uncompressible size", len(b1))
print("compressible size", len(b2))
print("raw lz4 size", len(b3))
print()

print("serialize uncompressible")
%timeit serialize_bytelist(a1)
print("serialize compressible")
%timeit serialize_bytelist(a2)
print("raw lz4 compress")
%timeit lz4.frame.compress(a2)
print()

print("deserialize uncompressible")
%timeit deserialize_bytes(b1)
print("deserialize compressible")
%timeit deserialize_bytes(b2)
print("raw lz4 decompress")
%timeit lz4.frame.decompress(b3)

Output:

uncompressible size 134217969
compressible size 526635
raw lz4 size 566291

serialize uncompressible
89.9 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
serialize compressible
9.96 ms ± 78.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
raw lz4 compress
9.75 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

deserialize uncompressible
19.2 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
deserialize compressible
143 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raw lz4 decompress
88.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

serialize_bytelist and deserialize_bytes in the code above are exactly what distributed.spill uses. However, I measured the timings in distributed.comm.tcp and I got the same number (134ms).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions