Deserialization of compressed data is sluggish and causes memory flares

If lz4 is installed and a chunk can be compressed efficiently, it is compressed automatically.
This happens both in network comms (configurable) and upon spilling (not configurable).

While working on https://github.com/coiled/coiled-runtime/pull/629, I found out that deserializing a compressed frame is abysmally slow:

![image](https://user-images.githubusercontent.com/6213168/209113228-9ca8f43e-1944-4c9b-965d-8c47b884dc2a.png)


When deserialization is triggered by network comms, it is offloaded to a separate thread for any message larger than 10 MiB (`distributed.comm.offload`). Note that a message will contain multiple small keys if possible.
I did not test if the GIL is released while this happens.

When deserialization is triggered by unspilling, the whole event loop is temporarily blocked.

### Microbenchmark
Serializing a 128 MiB compressible numpy array takes 9.96ms; there's a modest overhead on top of the raw call to lz4, which takes 9.75ms.
Deserializing the same takes a whopping 144ms. The raw lz4 decompress  takes an already huge 83ms.

At the very least we should be able to trim down the overhead. Also the raw decompression speed is suspiciously slow - typically, compression takes much longer than decompression so I suspect that something must be going on upstream.

```python
import io
import lz4.frame
import numpy
from distributed.protocol import deserialize_bytes, serialize_bytelist

a1 = numpy.random.random((4096, 4096))  # 128 MiB, uncompressible
a2 = numpy.ones((4096, 4096))  # 128 MiB, highly compressible

buf1 = io.BytesIO()
buf2 = io.BytesIO()
buf1.writelines(serialize_bytelist(a1))
buf2.writelines(serialize_bytelist(a2))
b1 = buf1.getvalue()
b2 = buf2.getvalue()
b3 = lz4.frame.compress(a2)

print("uncompressible size", len(b1))
print("compressible size", len(b2))
print("raw lz4 size", len(b3))
print()

print("serialize uncompressible")
%timeit serialize_bytelist(a1)
print("serialize compressible")
%timeit serialize_bytelist(a2)
print("raw lz4 compress")
%timeit lz4.frame.compress(a2)
print()

print("deserialize uncompressible")
%timeit deserialize_bytes(b1)
print("deserialize compressible")
%timeit deserialize_bytes(b2)
print("raw lz4 decompress")
%timeit lz4.frame.decompress(b3)
```
Output:
```
uncompressible size 134217969
compressible size 526635
raw lz4 size 566291

serialize uncompressible
89.9 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
serialize compressible
9.96 ms ± 78.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
raw lz4 compress
9.75 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

deserialize uncompressible
19.2 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
deserialize compressible
143 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raw lz4 decompress
88.3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

`serialize_bytelist`  and `deserialize_bytes` in the code above are exactly what `distributed.spill` uses. However, I measured the timings in `distributed.comm.tcp` and I got the same number (134ms).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deserialization of compressed data is sluggish and causes memory flares #7433

Microbenchmark

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Deserialization of compressed data is sluggish and causes memory flares #7433

Description

Microbenchmark

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions