Skip to content

feat(rag): generation/pointer model for crash-safe re-chunking and embedding lifecycle #438

@sqhyz55

Description

@sqhyz55

Context

PR #202 addresses stale chunk rows on re-chunk (#199) with a short-term delete/insert + scoped locking approach (replace_chunks, per-document locks, visible embedding cleanup failures).

Review feedback (2026-05-15) notes that this pattern remains fundamentally limited:

  • Crash safety depends on insert-before-delete ordering; incomplete runs can leave duplicate or partially updated generations searchable until cleanup completes.
  • Concurrency is hard to make correct with process-local locks alone: destructive updates are keyed by (collection, doc_id, parse_hash, user scope), while ingestion locks may be keyed differently (e.g. source_path vs file_id-derived doc_id). Two concurrent runs for the same logical document can still race across chunk replacement and embedding writes.
  • Embedding cleanup failures are high-impact: retrieval reads embeddings_* directly; stale embedding rows remain searchable even when chunks has been replaced.

The durable fix is an explicit generation / active-pointer model rather than relying on delete/insert timing.

Proposed design (high level)

  1. Each re-chunk / re-embed run writes under a new immutable generation_id (or equivalent).
  2. Maintain a small active-generation pointer table keyed by document scope, e.g. (collection, doc_id, parse_hash, user scope)active_generation_id.
  3. Retrieval only reads chunks and embeddings belonging to the active generation.
  4. Publish flow:
    • Write all chunks + embeddings for the new generation completely.
    • Atomically update the active pointer to the new generation (this is the only step requiring strict atomicity).
  5. Cleanup old generations asynchronously (best-effort is acceptable once they are no longer active/searchable).

Benefits

Concern Current delete/insert model Generation/pointer model
Crash mid-run Duplicates or partial state may be searchable Incomplete generations never published
Concurrent re-chunk Race between chunk replace and embedding write-back Old runs cannot publish; pointer move is atomic
Embedding cleanup failure Stale embeddings may remain searchable Inactive generations ignored by retrieval
Multi-worker Requires cross-process locks Pointer update + read path scoped by generation

Short-term (tracked in PR #202)

Until this issue is implemented, PR #202 uses:

  • Scoped locking keyed by actual replace scope (collection, doc_id, parse_hash, user scope) through chunk replace → embedding write
  • Cross-process lock (filelock) if ingestion runs in multiple workers
  • Visible embedding cascade-delete failures (raise / surface partial failure), not silent best-effort success

Scope / likely touch points

  • Storage abstraction: VectorIndexStore — generation-aware write/read APIs
  • chunk_document / replace_chunks — write under new generation instead of in-place delete
  • vector_manager / embedding upsert — tag rows with generation_id
  • Retrieval (dense/sparse/hybrid) — filter by active generation from pointer table
  • Migration: backfill pointer for existing data (single implicit generation per scope)

Acceptance criteria (draft)

  • Re-chunk with different config_hash never returns chunks/embeddings from a non-active generation
  • Crash after partial write of new generation does not change searchable results until pointer publish
  • Concurrent re-chunk for same (collection, doc_id, parse_hash) cannot resurrect stale embeddings
  • Old generations can be garbage-collected without affecting active retrieval
  • Documented migration path for existing LanceDB deployments

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions