Skip to content

Add callback for compressing updates in sqlite store #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 19, 2025

Conversation

Darshan808
Copy link
Contributor

References jupyterlab/jupyter-collaboration#430

Summary

This PR introduces compression of yupdates using the lz4 package to reduce the size of the .ystore.db file created by pycrdt-store.

Problem

For frequently edited or large documents, the .ystore.db file grows excessively large. Users have repeatedly raised concerns about its size.

To address this we have:

  • document_ttl
  • history_length or history_size in this PR

However, these approaches involve tradeoffs such as deleting history. A better long-term solution is to reduce the size of each update stored in the database.

Proposed Solution

This PR compresses each yupdate using lz4.frame.compress() before storing it in the database, and decompresses using lz4.frame.decompress() during retrieval.

Key details:

  • Uses lz4.frame.compress() with compression_level=0 for a balance between speed and compression ratio.
  • Read and Write performance is nearly unaffected, making it suitable for real-time editing.

Benefits

While small frequent updates don’t compress significantly, large updates benefit greatly.

Example Results:

  • A .db file of size 192 MB was reduced to 64 MB using this compression method.
  • A project folder (uses jupyter collaboration) with .db files totaling 1 GB was reduced to 200 MB after applying this change.

This allows:

  • Longer history retention
  • Lower disk usage

Migration Strategy

No changes are required from existing users because the system detects compression and handles both old (uncompressed) and new (compressed) updates.

@krassowski krassowski added the enhancement New feature or request label May 7, 2025
@krassowski
Copy link

Example Results:

  • A .db file of size 192 MB was reduced to 64 MB using this compression method.
  • A project folder (uses jupyter collaboration) with .db files totaling 1 GB was reduced to 200 MB after applying this change.

This sounds very promising! In addition, some comparison of the the loading time would be interesting. I suspect that on slower hard drives the cost of decompression will be less than the cost of reading more data from the drive, but I wonder what is the trade-off on SSD.

I think the crucial question would be whether adding a dependency on lz4 is worth it or if we should instead just use built-in zlib (or maybe even gzip). Could you compare the speed/size for the three approaches?

It looks like mypy complaints. There is a PR upstream, python-lz4/python-lz4#295 so if we do want to keep lz4, for now maybe we can mark it as ignored in pyproject.toml:

src/pycrdt/store/store.py:16: error: Skipping analyzing "lz4.frame": module is installed, but missing library stubs or py.typed marker  [import-untyped]
src/pycrdt/store/store.py:16: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
src/pycrdt/store/store.py:16: error: Skipping analyzing "lz4": module is installed, but missing library stubs or py.typed marker  [import-untyped]

@Darshan808
Copy link
Contributor Author

Darshan808 commented May 8, 2025

I compared three different compression methods and compiled the results below.

Note: The .db file used in this test originates from a directory with notebooks containing large code cells, heavy outputs, and numerous visualizations. Due to the size and complexity of these yupdates, the compression yielded significant space savings.

However, these results may not be representative for lighter notebooks with smaller cells and minimal outputs. Compression efficiency will vary depending on the nature of the data.

Database Size Before Compression

  • Original Size: 1.51 GB

Compression Results

Method Compressed Size Space Saved Compression Time Decompression Time
gzip 154.14 MB 89.8% 15.23 seconds 2.39 seconds
lz4 214.63 MB 85.8% 8.81 seconds 1.41 seconds
zstd 123.98 MB 91.8% 8.31 seconds 0.99 seconds

For smaller yupdates all methods take negligible time to compress but we can see some difference in speed when larger yupdates are compressed.

⏱ Per-Row Example (17.2 KB input)

Method Compressed Size Time Taken
gzip 2.57 KB 0.21 ms
lz4 3.79 KB 0.01 ms
zstd 2.59 KB 0.04 ms

Conclusion

  • Best Compression Ratio: zstd (91.8%)
  • Fastest Compression: lz4 (best for low-latency use)

Based on the results, I recommend using zstd for compression, depending on the zstandard library. It offers the best balance between compression ratio and speed.

However, if avoiding third-party dependencies is a priority, we could fall back to using the built-in gzip module. While it provides a good compression ratio, it is noticeably slower compared to zstd and lz4.

@Darshan808
Copy link
Contributor Author

I found something interesting: when compressing very small updates, the compression algorithms can sometimes increase the size.

Compression Algorithm Original Size Compressed Size
gzip 26 Bytes 37 Bytes
lz4 26 Bytes 48 Bytes
zstd 26 Bytes 32 Bytes

These algorithms add metadata and headers to manage the compression process, and for very small data, this overhead can sometime outweigh the compression benefit.

@krassowski
Copy link

Thank you for the comparison. I think python-zstandard is a reasonable dependency (it is an optional dependency of pandas), as is python-lz4, though I am not sure if we should introduce a hard dependency on either of this.

  1. Could the compression algorithm be user-configurable?
  2. Could you also test some more the built-in Python compression libraries in addition to gzip to see if we could use any of these as a safe, dependency-free default?

@krassowski
Copy link

krassowski commented May 8, 2025

From discussion in jupyter-server meeting, some other libraries to consider:

  • https://github.com/google/brotli: the advantage here is that the Python bindings are maintained directly by Google (as compared to zstandard and lz4)
  • anything from the rust ecosystem, since pycrdt already lives on top of the rust ecosystem.

@davidbrochart
Copy link
Collaborator

I feel that we are reinventing the wheel. Projects like Zarr have done a lot of work in that field, maybe we could just use numcodecs?

@Darshan808
Copy link
Contributor Author

Darshan808 commented May 13, 2025

Additional Findings:

The following additional compression methods were tested on the same .db file.
Again, results may vary for lighter notebooks.


Compression Results for Full .db File

Method Compressed Size (MiB) Space Saved (%) Compression Time Decompression Time
lzma 125.30 91.9% 26.03 s 2.92 s
bz2 209.60 86.5% 102.83 s 18.68 s
zlib 256.80 83.4% 14.60 s 3.02 s
brotli 143.09 90.8% 9.61 s 1.74s
numcodecs (Blosc zstd, clevel=5) 146.89 90.4% 16.21 s 0.81s

⏱ Per-Row Example (17,566 bytes input)

Method Compressed (bytes) Compressed (KiB) Time Taken
lzma 2,704 2.64 KiB 0.60 ms
bz2 2,998 2.93 KiB 1.34 ms
zlib 3,079 3.01 KiB 0.12 ms
brotli 2,953 2.88 KiB 0.05 ms
numcodecs (Blosc zstd, clevel=5) 2,500 2.44 KiB 0.27 ms

Recommendation

  • For the best balance of compression and speed, zstd or brotli are recommended, both offering excellent performance.

Which one should be preferred?

@Darshan808
Copy link
Contributor Author

Compression Results for Small Updates

Compression Algorithm Original Size Compressed Size Compression Time
numcodecs 28 Bytes 44 Bytes 0.01 ms
brotli 28 Bytes 32 Bytes 0.01 ms
zlib 28 Bytes 34 Bytes 0.01 ms
bz2 28 Bytes 75 Bytes 0.01 ms
lzma 28 Bytes 84 Bytes 0.02 ms

Conclusion

For smaller data, brotli provides the smallest compressed size and minimal overhead.

@krassowski
Copy link

krassowski commented May 13, 2025

brotli does look very promising indeed. How much more work would it be to add decompression time to the tables above?

One more consideration is that numcodecs depends on numpy which is a rather heavy dependency (5-16 MB wheel, depending on platform for numpy + 0.8-8 MB for numcodecs itself) and users will often have a different version installed/required. For comparison, brotli is 0.3-3 MB).

@Darshan808
Copy link
Contributor Author

Darshan808 commented May 13, 2025

How much more work would it be to add decompression time to the tables above?

Not much, a simple script did the job. I’ve added all the decompression times to the tables above.

If no further discussion or research is needed, should I start switching the code from LZ4 to Brotli?

@krassowski
Copy link

If no further discussion or research is needed, should I start switching the code from LZ4 to Brotli?

My remaining concerns is, how can we make this future-proof:

  • Should we store information about the compression algorithm used somewhere (probably not on per-update basis, maybe per document)?
  • Should we have a fallback for when the compression library of choice is not available?
  • Should it be user-configurable?

Curious to hear from @davidbrochart too

@davidbrochart
Copy link
Collaborator

I'm not sure compressing each row individually is the right solution. Maybe splitting the database into separate files and compressing these entire files would work better for us. The older files are less often accessed, so adding a decompression step before accessing them is fine, and this should yield better compression ratios.

@davidbrochart
Copy link
Collaborator

In any case, it seems that we could entirely externalize row compression/decompression by allowing to register respective callbacks. That would keep pycrdt-store agnostic to compression algorithms, and let people use what they want.

Comment on lines 337 to 338
_compress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment]
_decompress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a default value of None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to avoid having to check whether the methods are set on every call. But I’m open to using None as a default.

@Darshan808 Darshan808 requested a review from davidbrochart May 16, 2025 15:31
Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Darshan808.

@davidbrochart davidbrochart changed the title Compress yupdates to reduce SQLite DB file size Add callback for compressing updates in sqlite store May 19, 2025
@davidbrochart davidbrochart merged commit 0828c0b into y-crdt:main May 19, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants