Add `history_length` and `min_cleanup_interval` configurables and pruning logic #2

Darshan808 · 2025-04-30T17:29:23Z

Fixes y-crdt/pycrdt-websocket#117

Description

Introduce two new attributes on SQLiteYStore:
- history_length: maximum age (in seconds) of history to retain (None = infinite)
- min_cleanup_interval: minimum seconds between automatic cleanups
Update pruning logic to trigger once the oldest entry is older than history_length + min_cleanup_interval, then remove everything older than now - history_length

This approach is inspired by PR y-crdt/pycrdt-websocket#77 which appears to be inactive for a while.

Note

This PR is based on the branch from PR #1 and depends on the changes introduced there.

…ogic

Darshan808 · 2025-05-02T07:57:19Z

I initially took the idea of defining history_length based on the time difference between the oldest and newest update from this PR.
However, after reconsideration and from comment y-crdt/pycrdt-websocket#77 (comment), I believe defining history_length in terms of the .db file size might be a more robust approach.
This way, we can prevent the SQLite database file from growing uncontrollably, regardless of how frequently notebooks are edited.

Open to suggestions on this—

Should we maintain both history_length (based on time) and history_size (based on file size)?
Also, should we define a sensible default value for history_size?

CC @davidbrochart @krassowski

krassowski · 2025-05-02T08:42:19Z

I think trimming or otherwise compressing once the history size for any notebook reaches 1GB would be reasonable, I think it is rare to see notebooks larger than 500 MB.

Darshan808 · 2025-05-02T10:55:48Z

I think setting a fixed large max_history_size can actually hurt users who don't often work with large notebooks. For example, in PR jupyterlab/jupyter-collaboration#430 (comment) the author notes that even opening a notebook with 100 MB of .db file can feel sluggish.

Instead, I’d propose an adaptive cap strategy:

Start with a reasonable base cap, say 25 MB.
Whenever the .db file exceeds that cap:
- Squash the history into a single update
- Re-measure the .db size on disk (will be proportional to notebook size).
- Set the new cap to twice that size, giving 100% headroom for further edits.

This way, smaller notebooks don’t pay the penalty of large cap limits, and large notebooks get enough space to retain history without uncontrolled growth.

I’m open to suggestions or alternative approaches here—keen to hear what others think!

davidbrochart · 2025-05-05T09:27:19Z

Let's try and not talk about notebooks here, since pycrdt-store is independent of Jupyter and this might confuse users.
Actually I don't think we can trim the database based on its size only. Because it holds the updates of multiple documents, how would we decide which document history to trim? And the issue is not really the size of the database, but the size of each document's updates, right?

krassowski · 2025-05-06T08:58:45Z

And the issue is not really the size of the database, but the size of each document's updates, right?

I agree, I also think we should measure the size of each document. While we can only reliably measure the size these would take up in RAM, that size should be proportional to the size they take up when saved on disk in the database - I would think this is a reasonable approach.

Darshan808 · 2025-05-06T10:40:50Z

how would we decide which document history to trim?

I was thinking that when a user edits a document and the database file size has exceeded a certain threshold, we can reasonably assume that the currently edited document has also contributed to the increase. Based on that, we could prioritize squashing the history of the currently active document. While this may create a bias where initial notebooks might have greater history in the database, and later notebooks might not be able to store as much history, it helps ensure that the overall history size (db size) doesn't exceed a specified maximum.

And the issue is not really the size of the database, but the size of each document's updates, right?

A large number of updates does contribute to the overall size of the database, which in turn causes slower load times and degraded performance in JupyterLab and Notebook—particularly when used with jupyter-collaboration and pycrdt-store. We’ve also received feedback from users reporting they run out of disk space more quickly. So, I believe it's important to enforce a maximum cap on the database size.

I’ve explored this issue in more depth and proposed some possible solution like checkpointing and a compression algorithm for each document’s updates here. I’d really appreciate your thoughts and feedback on the approach.

Darshan808 added 3 commits April 30, 2025 15:56

Run VACUUM after document_ttl squash to reclaim freed pages

a2394a6

test-for-size-reduction

2e50735

Add history_length & min_cleanup_interval configurables and pruning l…

ade889d

…ogic

krassowski added the enhancement New feature or request label Apr 30, 2025

Darshan808 mentioned this pull request Apr 30, 2025

Add configurables to control ystore.db history and cleanup frequency jupyterlab/jupyter-collaboration#481

Open

Darshan808 added 3 commits April 30, 2025 23:53

fix-mypy-check

a89fe58

fix-pre-commit

23523f3

fix-test

a8209a4

Darshan808 mentioned this pull request May 6, 2025

Make the use of YStore configurable jupyterlab/jupyter-collaboration#430

Open

Darshan808 mentioned this pull request May 7, 2025

Add callback for compressing updates in sqlite store #8

Merged

This was referenced May 12, 2025

Weekly Team Meetings: Jan–Jun 2025 jupyterlab/frontends-team-compass#266

Open

Meeting Notes 2025 jupyter-server/team-compass#73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `history_length` and `min_cleanup_interval` configurables and pruning logic #2

Add `history_length` and `min_cleanup_interval` configurables and pruning logic #2

Uh oh!

Darshan808 commented Apr 30, 2025 •

edited

Loading

Uh oh!

Darshan808 commented May 2, 2025

Uh oh!

krassowski commented May 2, 2025

Uh oh!

Darshan808 commented May 2, 2025 •

edited

Loading

Uh oh!

davidbrochart commented May 5, 2025

Uh oh!

krassowski commented May 6, 2025

Uh oh!

Darshan808 commented May 6, 2025

Uh oh!

Uh oh!

Add history_length and min_cleanup_interval configurables and pruning logic #2

Are you sure you want to change the base?

Add history_length and min_cleanup_interval configurables and pruning logic #2

Uh oh!

Conversation

Darshan808 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes y-crdt/pycrdt-websocket#117

Description

Uh oh!

Darshan808 commented May 2, 2025

Uh oh!

krassowski commented May 2, 2025

Uh oh!

Darshan808 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidbrochart commented May 5, 2025

Uh oh!

krassowski commented May 6, 2025

Uh oh!

Darshan808 commented May 6, 2025

Uh oh!

Uh oh!

Add `history_length` and `min_cleanup_interval` configurables and pruning logic #2

Add `history_length` and `min_cleanup_interval` configurables and pruning logic #2

Darshan808 commented Apr 30, 2025 •

edited

Loading

Darshan808 commented May 2, 2025 •

edited

Loading