Description
If lz4 is installed and a chunk can be compressed efficiently, it is compressed automatically.
This happens both in network comms (configurable) and upon spilling (not configurable).
While working on coiled/benchmarks#629, I found out that deserializing a compressed frame is abysmally slow:
When deserialization is triggered by network comms, it is offloaded to a separate thread for any message larger than 10 MiB (distributed.comm.offload
). Note that a message will contain multiple small keys if possible.
I did not test if the GIL is released while this happens.
When deserialization is triggered by unspilling, the whole event loop is temporarily blocked.
Microbenchmark
Serializing a 128 MiB compressible numpy array takes 9.96ms; there's a modest overhead on top of the raw call to lz4, which takes 9.75ms.
Deserializing the same takes a whopping 144ms. The raw lz4 decompress takes an already huge 83ms.
At the very least we should be able to trim down the overhead. Also the raw decompression speed is suspiciously slow - typically, compression takes much longer than decompression so I suspect that something must be going on upstream.
import io
import lz4.frame
import numpy
from distributed.protocol import deserialize_bytes, serialize_bytelist
a1 = numpy.random.random((4096, 4096)) # 128 MiB, uncompressible
a2 = numpy.ones((4096, 4096)) # 128 MiB, highly compressible
buf1 = io.BytesIO()
buf2 = io.BytesIO()
buf1.writelines(serialize_bytelist(a1))
buf2.writelines(serialize_bytelist(a2))
b1 = buf1.getvalue()
b2 = buf2.getvalue()
b3 = lz4.frame.compress(a2)
print("uncompressible size", len(b1))
print("compressible size", len(b2))
print("raw lz4 size", len(b3))
print()
print("serialize uncompressible")
%timeit serialize_bytelist(a1)
print("serialize compressible")
%timeit serialize_bytelist(a2)
print("raw lz4 compress")
%timeit lz4.frame.compress(a2)
print()
print("deserialize uncompressible")
%timeit deserialize_bytes(b1)
print("deserialize compressible")
%timeit deserialize_bytes(b2)
print("raw lz4 decompress")
%timeit lz4.frame.decompress(b3)
Output:
uncompressible size 134217969
compressible size 526635
raw lz4 size 566291
serialize uncompressible
89.9 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
serialize compressible
9.96 ms ± 78.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
raw lz4 compress
9.75 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
deserialize uncompressible
19.2 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
deserialize compressible
143 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raw lz4 decompress
88.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
serialize_bytelist
and deserialize_bytes
in the code above are exactly what distributed.spill
uses. However, I measured the timings in distributed.comm.tcp
and I got the same number (134ms).