Context
PR #202 addresses stale chunk rows on re-chunk (#199) with a short-term delete/insert + scoped locking approach (replace_chunks, per-document locks, visible embedding cleanup failures).
Review feedback (2026-05-15) notes that this pattern remains fundamentally limited:
- Crash safety depends on insert-before-delete ordering; incomplete runs can leave duplicate or partially updated generations searchable until cleanup completes.
- Concurrency is hard to make correct with process-local locks alone: destructive updates are keyed by
(collection, doc_id, parse_hash, user scope), while ingestion locks may be keyed differently (e.g. source_path vs file_id-derived doc_id). Two concurrent runs for the same logical document can still race across chunk replacement and embedding writes.
- Embedding cleanup failures are high-impact: retrieval reads
embeddings_* directly; stale embedding rows remain searchable even when chunks has been replaced.
The durable fix is an explicit generation / active-pointer model rather than relying on delete/insert timing.
Proposed design (high level)
- Each re-chunk / re-embed run writes under a new immutable
generation_id (or equivalent).
- Maintain a small active-generation pointer table keyed by document scope, e.g.
(collection, doc_id, parse_hash, user scope) → active_generation_id.
- Retrieval only reads chunks and embeddings belonging to the active generation.
- Publish flow:
- Write all chunks + embeddings for the new generation completely.
- Atomically update the active pointer to the new generation (this is the only step requiring strict atomicity).
- Cleanup old generations asynchronously (best-effort is acceptable once they are no longer active/searchable).
Benefits
| Concern |
Current delete/insert model |
Generation/pointer model |
| Crash mid-run |
Duplicates or partial state may be searchable |
Incomplete generations never published |
| Concurrent re-chunk |
Race between chunk replace and embedding write-back |
Old runs cannot publish; pointer move is atomic |
| Embedding cleanup failure |
Stale embeddings may remain searchable |
Inactive generations ignored by retrieval |
| Multi-worker |
Requires cross-process locks |
Pointer update + read path scoped by generation |
Short-term (tracked in PR #202)
Until this issue is implemented, PR #202 uses:
- Scoped locking keyed by actual replace scope
(collection, doc_id, parse_hash, user scope) through chunk replace → embedding write
- Cross-process lock (
filelock) if ingestion runs in multiple workers
- Visible embedding cascade-delete failures (raise / surface partial failure), not silent best-effort success
Scope / likely touch points
- Storage abstraction:
VectorIndexStore — generation-aware write/read APIs
chunk_document / replace_chunks — write under new generation instead of in-place delete
vector_manager / embedding upsert — tag rows with generation_id
- Retrieval (dense/sparse/hybrid) — filter by active generation from pointer table
- Migration: backfill pointer for existing data (single implicit generation per scope)
Acceptance criteria (draft)
References
Context
PR #202 addresses stale chunk rows on re-chunk (#199) with a short-term delete/insert + scoped locking approach (
replace_chunks, per-document locks, visible embedding cleanup failures).Review feedback (2026-05-15) notes that this pattern remains fundamentally limited:
(collection, doc_id, parse_hash, user scope), while ingestion locks may be keyed differently (e.g.source_pathvsfile_id-deriveddoc_id). Two concurrent runs for the same logical document can still race across chunk replacement and embedding writes.embeddings_*directly; stale embedding rows remain searchable even whenchunkshas been replaced.The durable fix is an explicit generation / active-pointer model rather than relying on delete/insert timing.
Proposed design (high level)
generation_id(or equivalent).(collection, doc_id, parse_hash, user scope)→active_generation_id.Benefits
Short-term (tracked in PR #202)
Until this issue is implemented, PR #202 uses:
(collection, doc_id, parse_hash, user scope)through chunk replace → embedding writefilelock) if ingestion runs in multiple workersScope / likely touch points
VectorIndexStore— generation-aware write/read APIschunk_document/replace_chunks— write under new generation instead of in-place deletevector_manager/ embedding upsert — tag rows withgeneration_idAcceptance criteria (draft)
config_hashnever returns chunks/embeddings from a non-active generation(collection, doc_id, parse_hash)cannot resurrect stale embeddingsReferences
replace_chunks+ locking (review: generation model as follow-up)