-
Notifications
You must be signed in to change notification settings - Fork 2
Add history_length
and min_cleanup_interval
configurables and pruning logic
#2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I initially took the idea of defining Open to suggestions on this—
|
I think trimming or otherwise compressing once the history size for any notebook reaches 1GB would be reasonable, I think it is rare to see notebooks larger than 500 MB. |
I think setting a fixed large Instead, I’d propose an adaptive cap strategy:
This way, smaller notebooks don’t pay the penalty of large cap limits, and large notebooks get enough space to retain history without uncontrolled growth. I’m open to suggestions or alternative approaches here—keen to hear what others think! |
Let's try and not talk about notebooks here, since |
I agree, I also think we should measure the size of each document. While we can only reliably measure the size these would take up in RAM, that size should be proportional to the size they take up when saved on disk in the database - I would think this is a reasonable approach. |
I was thinking that when a user edits a document and the database file size has exceeded a certain threshold, we can reasonably assume that the currently edited document has also contributed to the increase. Based on that, we could prioritize squashing the history of the currently active document. While this may create a bias where initial notebooks might have greater history in the database, and later notebooks might not be able to store as much history, it helps ensure that the overall history size (db size) doesn't exceed a specified maximum.
A large number of updates does contribute to the overall size of the database, which in turn causes slower load times and degraded performance in JupyterLab and Notebook—particularly when used with jupyter-collaboration and pycrdt-store. We’ve also received feedback from users reporting they run out of disk space more quickly. So, I believe it's important to enforce a maximum cap on the database size. I’ve explored this issue in more depth and proposed some possible solution like checkpointing and a compression algorithm for each document’s updates here. I’d really appreciate your thoughts and feedback on the approach. |
Fixes y-crdt/pycrdt-websocket#117
Description
SQLiteYStore
:history_length
: maximum age (in seconds) of history to retain (None = infinite)min_cleanup_interval
: minimum seconds between automatic cleanupshistory_length + min_cleanup_interval
, then remove everything older thannow - history_length
This approach is inspired by PR y-crdt/pycrdt-websocket#77 which appears to be inactive for a while.
Note
This PR is based on the branch from PR #1 and depends on the changes introduced there.