Skip to content

Feature: introduce incremental snapshotting#756

Open
desertfury wants to merge 8 commits into
unum-cloud:mainfrom
desertfury:feat/global_rebuild
Open

Feature: introduce incremental snapshotting#756
desertfury wants to merge 8 commits into
unum-cloud:mainfrom
desertfury:feat/global_rebuild

Conversation

@desertfury
Copy link
Copy Markdown
Contributor

@desertfury desertfury commented May 22, 2026

Motivation

Persisting a large HNSW index with save is a stop-the-world operation:
the index serves no reads or writes for the entire flush. This PR adds an
interruptible, resumable serialization path and a global_rebuild_gt
adapter that periodically reconstructs and persists the index without
that pause - and without doubling memory.

Resumable serialization

index.hpp / index_dense.hpp gain a save_to_stream(output, state&, budget)
overload alongside the existing blocking one:

  • index_serialized_state_t / index_dense_serialized_state_t are
    continuation cursors (stage + position + frozen counts).
  • Each call writes at most budget nodes/vectors, then returns; pass the
    same cursor back to resume. The byte layout is identical to the blocking
    save_to_stream, so the file loads with the regular load.

global_rebuild.hpp - the orchestrator

global_rebuild_gt wraps a live "primary" index and drives a rebuild in
small budgeted steps, so the index keeps serving traffic throughout:

  1. migrate - re-insert the primary's key-set (as of begin()) into a
    fresh "shadow" peer, reconstructing the HNSW graph from scratch;
  2. save - stream the now-frozen shadow to disk via the resumable
    save_to_stream;
  3. done - close the file, release the shadow, replay deferred removals.

Concurrent-mutation routing: add always goes to the primary; remove
during a rebuild is tombstoned and replayed at completion, so the on-disk
snapshot equals the begin() generation exactly.

The idea behind that adapter is described at Chapter 5 of The Design of Dynamic Data Structures book (https://link.springer.com/book/10.1007/BFb0014927)

Memory: no doubling

The shadow needs its own graph, but not a second copy of the vectors.
It is built with add(..., copy_vector = false), so every shadow node
references the primary's stored vector bytes (new index_dense_gt:: vector_data accessor) instead of duplicating them. Peak rebuild overhead
drops from vectors + graph (~2x) to one graph, and is released at
completion.

Testing

  • test_global_rebuild - drives a non-blocking rebuild on a clustered
    12k x 256d dataset with interleaved add/remove, reloads the persisted
    file, and asserts: recall is preserved (1.000 -> 1.000), the shadow
    duplicates zero vectors (memory_stats().vectors_allocated == 0), and
    the process RSS sampled across the rebuild grows ~10% - never doubling.

@desertfury
Copy link
Copy Markdown
Contributor Author

I also ran global_rebuild_gt through the insert-heavy tests to confirm the adapter doesn't fall over under a variety of workloads. And wired the adapter into test_collection across the whole matrix (scalars, connectivity, dimensions, collection sizes). Didnt commit that one

@desertfury desertfury marked this pull request as draft May 22, 2026 21:16
@desertfury desertfury marked this pull request as ready for review May 26, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant