-
Notifications
You must be signed in to change notification settings - Fork 2
Add callback for compressing updates in sqlite store #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This sounds very promising! In addition, some comparison of the the loading time would be interesting. I suspect that on slower hard drives the cost of decompression will be less than the cost of reading more data from the drive, but I wonder what is the trade-off on SSD. I think the crucial question would be whether adding a dependency on It looks like mypy complaints. There is a PR upstream, python-lz4/python-lz4#295 so if we do want to keep lz4, for now maybe we can mark it as ignored in
|
I compared three different compression methods and compiled the results below. Note: The However, these results may not be representative for lighter notebooks with smaller cells and minimal outputs. Compression efficiency will vary depending on the nature of the data. Database Size Before Compression
Compression Results
For smaller yupdates all methods take negligible time to compress but we can see some difference in speed when larger yupdates are compressed. ⏱ Per-Row Example (17.2 KB input)
Conclusion
Based on the results, I recommend using zstd for compression, depending on the However, if avoiding third-party dependencies is a priority, we could fall back to using the built-in |
I found something interesting: when compressing very small updates, the compression algorithms can sometimes increase the size.
These algorithms add metadata and headers to manage the compression process, and for very small data, this overhead can sometime outweigh the compression benefit. |
Thank you for the comparison. I think
|
From discussion in jupyter-server meeting, some other libraries to consider:
|
I feel that we are reinventing the wheel. Projects like Zarr have done a lot of work in that field, maybe we could just use numcodecs? |
Additional Findings: The following additional compression methods were tested on the same Compression Results for Full
|
Method | Compressed Size (MiB) | Space Saved (%) | Compression Time | Decompression Time |
---|---|---|---|---|
lzma | 125.30 | 91.9% | 26.03 s | 2.92 s |
bz2 | 209.60 | 86.5% | 102.83 s | 18.68 s |
zlib | 256.80 | 83.4% | 14.60 s | 3.02 s |
brotli | 143.09 | 90.8% | 9.61 s | 1.74s |
numcodecs (Blosc zstd, clevel=5) | 146.89 | 90.4% | 16.21 s | 0.81s |
⏱ Per-Row Example (17,566 bytes input)
Method | Compressed (bytes) | Compressed (KiB) | Time Taken |
---|---|---|---|
lzma | 2,704 | 2.64 KiB | 0.60 ms |
bz2 | 2,998 | 2.93 KiB | 1.34 ms |
zlib | 3,079 | 3.01 KiB | 0.12 ms |
brotli | 2,953 | 2.88 KiB | 0.05 ms |
numcodecs (Blosc zstd, clevel=5) | 2,500 | 2.44 KiB | 0.27 ms |
Recommendation
- For the best balance of compression and speed,
zstd
orbrotli
are recommended, both offering excellent performance.
Which one should be preferred?
Compression Results for Small Updates
ConclusionFor smaller data, |
One more consideration is that |
Not much, a simple script did the job. I’ve added all the decompression times to the tables above. If no further discussion or research is needed, should I start switching the code from LZ4 to Brotli? |
My remaining concerns is, how can we make this future-proof:
Curious to hear from @davidbrochart too |
I'm not sure compressing each row individually is the right solution. Maybe splitting the database into separate files and compressing these entire files would work better for us. The older files are less often accessed, so adding a decompression step before accessing them is fine, and this should yield better compression ratios. |
In any case, it seems that we could entirely externalize row compression/decompression by allowing to register respective callbacks. That would keep |
src/pycrdt/store/store.py
Outdated
_compress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment] | ||
_decompress: Callable[[bytes], bytes] = staticmethod(lambda b: b) # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have a default value of None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to avoid having to check whether the methods are set on every call. But I’m open to using None
as a default.
Co-authored-by: David Brochart <[email protected]>
Co-authored-by: David Brochart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Darshan808.
References jupyterlab/jupyter-collaboration#430
Summary
This PR introduces compression of
yupdates
using thelz4
package to reduce the size of the.ystore.db
file created bypycrdt-store
.Problem
For frequently edited or large documents, the
.ystore.db
file grows excessively large. Users have repeatedly raised concerns about its size.To address this we have:
document_ttl
history_length
orhistory_size
in this PRHowever, these approaches involve tradeoffs such as deleting history. A better long-term solution is to reduce the size of each update stored in the database.
Proposed Solution
This PR compresses each
yupdate
usinglz4.frame.compress()
before storing it in the database, and decompresses usinglz4.frame.decompress()
during retrieval.Key details:
lz4.frame.compress()
withcompression_level=0
for a balance between speed and compression ratio.Benefits
While small frequent updates don’t compress significantly, large updates benefit greatly.
Example Results:
.db
file of size 192 MB was reduced to 64 MB using this compression method.jupyter collaboration
) with.db
files totaling 1 GB was reduced to 200 MB after applying this change.This allows:
Migration Strategy
No changes are required from existing users because the system detects compression and handles both old (uncompressed) and new (compressed) updates.