Skip to content

Add history_length and min_cleanup_interval configurables and pruning logic #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Darshan808
Copy link
Contributor

@Darshan808 Darshan808 commented Apr 30, 2025

Fixes y-crdt/pycrdt-websocket#117

Description

  • Introduce two new attributes on SQLiteYStore:
    • history_length: maximum age (in seconds) of history to retain (None = infinite)
    • min_cleanup_interval: minimum seconds between automatic cleanups
  • Update pruning logic to trigger once the oldest entry is older than history_length + min_cleanup_interval, then remove everything older than now - history_length

This approach is inspired by PR y-crdt/pycrdt-websocket#77 which appears to be inactive for a while.

Note

This PR is based on the branch from PR #1 and depends on the changes introduced there.

@Darshan808
Copy link
Contributor Author

I initially took the idea of defining history_length based on the time difference between the oldest and newest update from this PR.
However, after reconsideration and from comment y-crdt/pycrdt-websocket#77 (comment), I believe defining history_length in terms of the .db file size might be a more robust approach.
This way, we can prevent the SQLite database file from growing uncontrollably, regardless of how frequently notebooks are edited.

Open to suggestions on this—

  • Should we maintain both history_length (based on time) and history_size (based on file size)?

  • Also, should we define a sensible default value for history_size?

CC @davidbrochart @krassowski

@krassowski
Copy link

I think trimming or otherwise compressing once the history size for any notebook reaches 1GB would be reasonable, I think it is rare to see notebooks larger than 500 MB.

@Darshan808
Copy link
Contributor Author

Darshan808 commented May 2, 2025

I think setting a fixed large max_history_size can actually hurt users who don't often work with large notebooks. For example, in PR jupyterlab/jupyter-collaboration#430 (comment) the author notes that even opening a notebook with 100 MB of .db file can feel sluggish.

Instead, I’d propose an adaptive cap strategy:

  1. Start with a reasonable base cap, say 25 MB.
  2. Whenever the .db file exceeds that cap:
    • Squash the history into a single update
    • Re-measure the .db size on disk (will be proportional to notebook size).
    • Set the new cap to twice that size, giving 100% headroom for further edits.

This way, smaller notebooks don’t pay the penalty of large cap limits, and large notebooks get enough space to retain history without uncontrolled growth.

I’m open to suggestions or alternative approaches here—keen to hear what others think!

@davidbrochart
Copy link
Collaborator

Let's try and not talk about notebooks here, since pycrdt-store is independent of Jupyter and this might confuse users.
Actually I don't think we can trim the database based on its size only. Because it holds the updates of multiple documents, how would we decide which document history to trim? And the issue is not really the size of the database, but the size of each document's updates, right?

@krassowski
Copy link

And the issue is not really the size of the database, but the size of each document's updates, right?

I agree, I also think we should measure the size of each document. While we can only reliably measure the size these would take up in RAM, that size should be proportional to the size they take up when saved on disk in the database - I would think this is a reasonable approach.

@Darshan808
Copy link
Contributor Author

how would we decide which document history to trim?

I was thinking that when a user edits a document and the database file size has exceeded a certain threshold, we can reasonably assume that the currently edited document has also contributed to the increase. Based on that, we could prioritize squashing the history of the currently active document. While this may create a bias where initial notebooks might have greater history in the database, and later notebooks might not be able to store as much history, it helps ensure that the overall history size (db size) doesn't exceed a specified maximum.

And the issue is not really the size of the database, but the size of each document's updates, right?

A large number of updates does contribute to the overall size of the database, which in turn causes slower load times and degraded performance in JupyterLab and Notebook—particularly when used with jupyter-collaboration and pycrdt-store. We’ve also received feedback from users reporting they run out of disk space more quickly. So, I believe it's important to enforce a maximum cap on the database size.

I’ve explored this issue in more depth and proposed some possible solution like checkpointing and a compression algorithm for each document’s updates here. I’d really appreciate your thoughts and feedback on the approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

document_ttl squash logic may never be invoked for busy notebooks
3 participants