Add callback for compressing updates in sqlite store #8

Darshan808 · 2025-05-07T11:50:53Z

References jupyterlab/jupyter-collaboration#430

Summary

This PR introduces compression of yupdates using the lz4 package to reduce the size of the .ystore.db file created by pycrdt-store.

Problem

For frequently edited or large documents, the .ystore.db file grows excessively large. Users have repeatedly raised concerns about its size.

To address this we have:

document_ttl
history_length or history_size in this PR

However, these approaches involve tradeoffs such as deleting history. A better long-term solution is to reduce the size of each update stored in the database.

Proposed Solution

This PR compresses each yupdate using lz4.frame.compress() before storing it in the database, and decompresses using lz4.frame.decompress() during retrieval.

Key details:

Uses lz4.frame.compress() with compression_level=0 for a balance between speed and compression ratio.
Read and Write performance is nearly unaffected, making it suitable for real-time editing.

Benefits

While small frequent updates don’t compress significantly, large updates benefit greatly.

Example Results:

A .db file of size 192 MB was reduced to 64 MB using this compression method.
A project folder (uses jupyter collaboration) with .db files totaling 1 GB was reduced to 200 MB after applying this change.

This allows:

Longer history retention
Lower disk usage

Migration Strategy

No changes are required from existing users because the system detects compression and handles both old (uncompressed) and new (compressed) updates.

krassowski · 2025-05-07T13:35:42Z

Example Results:

A .db file of size 192 MB was reduced to 64 MB using this compression method.

A project folder (uses jupyter collaboration) with .db files totaling 1 GB was reduced to 200 MB after applying this change.

This sounds very promising! In addition, some comparison of the the loading time would be interesting. I suspect that on slower hard drives the cost of decompression will be less than the cost of reading more data from the drive, but I wonder what is the trade-off on SSD.

I think the crucial question would be whether adding a dependency on lz4 is worth it or if we should instead just use built-in zlib (or maybe even gzip). Could you compare the speed/size for the three approaches?

It looks like mypy complaints. There is a PR upstream, python-lz4/python-lz4#295 so if we do want to keep lz4, for now maybe we can mark it as ignored in pyproject.toml:

src/pycrdt/store/store.py:16: error: Skipping analyzing "lz4.frame": module is installed, but missing library stubs or py.typed marker  [import-untyped]
src/pycrdt/store/store.py:16: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
src/pycrdt/store/store.py:16: error: Skipping analyzing "lz4": module is installed, but missing library stubs or py.typed marker  [import-untyped]

Darshan808 · 2025-05-08T05:14:57Z

I compared three different compression methods and compiled the results below.

Note: The .db file used in this test originates from a directory with notebooks containing large code cells, heavy outputs, and numerous visualizations. Due to the size and complexity of these yupdates, the compression yielded significant space savings.

However, these results may not be representative for lighter notebooks with smaller cells and minimal outputs. Compression efficiency will vary depending on the nature of the data.

Database Size Before Compression

Original Size: 1.51 GB

Compression Results

Method	Compressed Size	Space Saved	Compression Time	Decompression Time
gzip	154.14 MB	89.8%	15.23 seconds	2.39 seconds
lz4	214.63 MB	85.8%	8.81 seconds	1.41 seconds
zstd	123.98 MB	91.8%	8.31 seconds	0.99 seconds

For smaller yupdates all methods take negligible time to compress but we can see some difference in speed when larger yupdates are compressed.

⏱ Per-Row Example (17.2 KB input)

Method	Compressed Size	Time Taken
gzip	2.57 KB	0.21 ms
lz4	3.79 KB	0.01 ms
zstd	2.59 KB	0.04 ms

Conclusion

Best Compression Ratio: zstd (91.8%)
Fastest Compression: lz4 (best for low-latency use)

Based on the results, I recommend using zstd for compression, depending on the zstandard library. It offers the best balance between compression ratio and speed.

However, if avoiding third-party dependencies is a priority, we could fall back to using the built-in gzip module. While it provides a good compression ratio, it is noticeably slower compared to zstd and lz4.

Darshan808 · 2025-05-08T05:23:02Z

I found something interesting: when compressing very small updates, the compression algorithms can sometimes increase the size.

Compression Algorithm	Original Size	Compressed Size
gzip	26 Bytes	37 Bytes
lz4	26 Bytes	48 Bytes
zstd	26 Bytes	32 Bytes

These algorithms add metadata and headers to manage the compression process, and for very small data, this overhead can sometime outweigh the compression benefit.

krassowski · 2025-05-08T08:35:18Z

Thank you for the comparison. I think python-zstandard is a reasonable dependency (it is an optional dependency of pandas), as is python-lz4, though I am not sure if we should introduce a hard dependency on either of this.

Could the compression algorithm be user-configurable?
Could you also test some more the built-in Python compression libraries in addition to gzip to see if we could use any of these as a safe, dependency-free default?
- https://docs.python.org/3/library/lzma.html
- https://docs.python.org/3/library/bz2.html
- https://docs.python.org/3/library/zlib.html (while gzip uses zlib under the hood, I understand there is potentially less overhead when using zlib directly)

krassowski · 2025-05-08T15:49:26Z

From discussion in jupyter-server meeting, some other libraries to consider:

https://github.com/google/brotli: the advantage here is that the Python bindings are maintained directly by Google (as compared to zstandard and lz4)
anything from the rust ecosystem, since pycrdt already lives on top of the rust ecosystem.

davidbrochart · 2025-05-08T17:10:41Z

I feel that we are reinventing the wheel. Projects like Zarr have done a lot of work in that field, maybe we could just use numcodecs?

Darshan808 · 2025-05-13T09:46:31Z

Additional Findings:

The following additional compression methods were tested on the same .db file.
Again, results may vary for lighter notebooks.

Compression Results for Full `.db` File

Method	Compressed Size (MiB)	Space Saved (%)	Compression Time	Decompression Time
lzma	125.30	91.9%	26.03 s	2.92 s
bz2	209.60	86.5%	102.83 s	18.68 s
zlib	256.80	83.4%	14.60 s	3.02 s
brotli	143.09	90.8%	9.61 s	1.74s
numcodecs (Blosc zstd, clevel=5)	146.89	90.4%	16.21 s	0.81s

⏱ Per-Row Example (17,566 bytes input)

Method	Compressed (bytes)	Compressed (KiB)	Time Taken
lzma	2,704	2.64 KiB	0.60 ms
bz2	2,998	2.93 KiB	1.34 ms
zlib	3,079	3.01 KiB	0.12 ms
brotli	2,953	2.88 KiB	0.05 ms
numcodecs (Blosc zstd, clevel=5)	2,500	2.44 KiB	0.27 ms

Recommendation

For the best balance of compression and speed, zstd or brotli are recommended, both offering excellent performance.

Which one should be preferred?

Darshan808 · 2025-05-13T09:51:11Z

Compression Results for Small Updates

Compression Algorithm	Original Size	Compressed Size	Compression Time
numcodecs	28 Bytes	44 Bytes	0.01 ms
brotli	28 Bytes	32 Bytes	0.01 ms
zlib	28 Bytes	34 Bytes	0.01 ms
bz2	28 Bytes	75 Bytes	0.01 ms
lzma	28 Bytes	84 Bytes	0.02 ms

Conclusion

For smaller data, brotli provides the smallest compressed size and minimal overhead.

krassowski · 2025-05-13T11:25:27Z

brotli does look very promising indeed. How much more work would it be to add decompression time to the tables above?

One more consideration is that numcodecs depends on numpy which is a rather heavy dependency (5-16 MB wheel, depending on platform for numpy + 0.8-8 MB for numcodecs itself) and users will often have a different version installed/required. For comparison, brotli is 0.3-3 MB).

Darshan808 · 2025-05-13T12:15:14Z

How much more work would it be to add decompression time to the tables above?

Not much, a simple script did the job. I’ve added all the decompression times to the tables above.

If no further discussion or research is needed, should I start switching the code from LZ4 to Brotli?

krassowski · 2025-05-13T13:19:42Z

If no further discussion or research is needed, should I start switching the code from LZ4 to Brotli?

My remaining concerns is, how can we make this future-proof:

Should we store information about the compression algorithm used somewhere (probably not on per-update basis, maybe per document)?
Should we have a fallback for when the compression library of choice is not available?
Should it be user-configurable?

Curious to hear from @davidbrochart too

davidbrochart · 2025-05-13T14:34:18Z

I'm not sure compressing each row individually is the right solution. Maybe splitting the database into separate files and compressing these entire files would work better for us. The older files are less often accessed, so adding a decompression step before accessing them is fine, and this should yield better compression ratios.

davidbrochart · 2025-05-14T13:31:31Z

In any case, it seems that we could entirely externalize row compression/decompression by allowing to register respective callbacks. That would keep pycrdt-store agnostic to compression algorithms, and let people use what they want.

pyproject.toml

davidbrochart · 2025-05-16T15:11:23Z

src/pycrdt/store/store.py

+    _compress: Callable[[bytes], bytes] = staticmethod(lambda b: b)  # type: ignore[assignment]
+    _decompress: Callable[[bytes], bytes] = staticmethod(lambda b: b)  # type: ignore[assignment]


Maybe have a default value of None?

I did this to avoid having to check whether the methods are set on every call. But I’m open to using None as a default.

Co-authored-by: David Brochart <[email protected]>

pyproject.toml

tests/test_store.py

Co-authored-by: David Brochart <[email protected]>

davidbrochart

Thanks @Darshan808.

Darshan808 added 2 commits May 7, 2025 17:10

add-compression-to-yupdates

d946dcb

make-backward-compatible

2572507

krassowski added the enhancement New feature or request label May 7, 2025

This was referenced May 12, 2025

Weekly Team Meetings: Jan–Jun 2025 jupyterlab/frontends-team-compass#266

Open

Meeting Notes 2025 jupyter-server/team-compass#73

Open

Darshan808 added 4 commits May 16, 2025 20:11

add-compression-decompression-callbacks

159e45f

remove-none

dc0ec87

remove-staticmethod

06a9554

ignore-type

6d30be9

davidbrochart requested changes May 16, 2025

View reviewed changes

Darshan808 and others added 5 commits May 16, 2025 21:00

add-test

7c3947f

Update pyproject.toml

b5fdd11

Co-authored-by: David Brochart <[email protected]>

set--as-default

14c510b

resolve_conflict

2285eb9

refactor_code

813cc9b

Darshan808 requested a review from davidbrochart May 16, 2025 15:31

modify-test

3ab39e2

davidbrochart reviewed May 19, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

tests/test_store.py Outdated Show resolved Hide resolved

tests/test_store.py Show resolved Hide resolved

tests/test_store.py Outdated Show resolved Hide resolved

Darshan808 and others added 3 commits May 19, 2025 15:37

Update pyproject.toml

11d27d6

Co-authored-by: David Brochart <[email protected]>

modify-test

ac276f2

merge-with-main

f6bf966

davidbrochart approved these changes May 19, 2025

View reviewed changes

davidbrochart changed the title ~~Compress yupdates to reduce SQLite DB file size~~ Add callback for compressing updates in sqlite store May 19, 2025

davidbrochart merged commit 0828c0b into y-crdt:main May 19, 2025
19 checks passed

		_compress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment]
		_decompress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment]

Add callback for compressing updates in sqlite store #8

Add callback for compressing updates in sqlite store #8

Uh oh!

Conversation

Darshan808 commented May 7, 2025

References jupyterlab/jupyter-collaboration#430

Summary

Problem

Proposed Solution

Benefits

Migration Strategy

Uh oh!

krassowski commented May 7, 2025

Uh oh!

Darshan808 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Database Size Before Compression

Compression Results

⏱ Per-Row Example (17.2 KB input)

Conclusion

Uh oh!

Darshan808 commented May 8, 2025

Uh oh!

krassowski commented May 8, 2025

Uh oh!

krassowski commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidbrochart commented May 8, 2025

Uh oh!

Darshan808 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Compression Results for Full .db File

⏱ Per-Row Example (17,566 bytes input)

Recommendation

Uh oh!

Darshan808 commented May 13, 2025

Compression Results for Small Updates

Conclusion

Uh oh!

krassowski commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Darshan808 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krassowski commented May 13, 2025

Uh oh!

davidbrochart commented May 13, 2025

Uh oh!

davidbrochart commented May 14, 2025

Uh oh!

Uh oh!

davidbrochart May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Darshan808 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidbrochart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Darshan808 commented May 8, 2025 •

edited

Loading

krassowski commented May 8, 2025 •

edited

Loading

Darshan808 commented May 13, 2025 •

edited

Loading

Compression Results for Full `.db` File

krassowski commented May 13, 2025 •

edited

Loading

Darshan808 commented May 13, 2025 •

edited

Loading