Skip to content

[FEATURE] Version-driven ingestion — push targets a version, not live chunks #282

@pjmalandrino

Description

@pjmalandrino

Context

Today push targets the live chunks of a document, not a version. The relationship between History (the version timeline) and the Ingest tab (the push timeline) is implicit — they share a chunkset_hash but nothing else.

Timeline 1: History       v1 ──── v2 ──── v3 ──── (live edits)
                          │       │       │
                          frozen  frozen  frozen

Timeline 2: Pushes        [hash=A] ─── [hash=B] ─── [hash=C]
                          push#1       push#2      push#3

Symptoms of the gap:

  • The Stale badge on the Ingest tab does not say which version drifted from what. The user has to deduce mentally.
  • chunk_pushes.chunk_ids is a list of IDs; a hard-deleted chunk leaves the audit row pointing at a ghost. The frozen chunks_snapshot in document_versions does not suffer from that, but pushes do not reference it today.
  • "Which version of doc X is in store Y right now?" requires joining hashes, not entities. Painful in queries, harder in conversations.
  • When prod RAG starts hitting the store, the question "which version of this doc is the RAG using?" has no clean answer.

The mental model we want is the git release pattern: you do not deploy a moving target, you tag, then deploy the tag. Push = deploy. The version is the tag.

Goal

Make the relationship explicit at the data + UX level:

  • The unit of ingestion is a document_version, not the live chunks.
  • Every chunk_pushes row references a document_versions.id.
  • The History drawer surfaces per-store ingestion state per version.
  • The Ingest tab states "v3 is in Neo4j Local" rather than "hash A93B…".
  • Diff endpoint becomes version A vs version B with dates and summaries.

Scope

In scope

  • Backend
    • domain/models.pyChunkPush gains document_version_id: str (mandatory after migration window).
    • domain/models.pyDocumentStoreLink switches its source-of-truth from chunkset_hash to document_version_id (or carries both during the transition).
    • persistence/database.pychunk_pushes.document_version_id column + FK to document_versions(id) ON DELETE SET NULL (audit survives version purge).
    • services/chunk_service.py::push_to_store — resolution policy: if no fresh version exists, auto-create a CHUNKS version snapshot before pushing (analogous to git tag before git push). Configurable: implicit (default) vs explicit (require user to "+ Generate chunks" first).
    • services/store_service.py — diff endpoint takes two version_ids instead of (doc, store) → (doc, store).
  • Frontend
    • History drawer — each version row carries per-store badges: [Neo4j Local ✓ · OpenSearch ✗].
    • Ingest tab — replaces hash-based language with version language: "v3 is in Neo4j Local since 14:32".
    • Restore-from-version — show "this version is not ingested anywhere yet" affordance.

Out of scope (follow-ups)

  • Push of a version other than the current one (i.e. push v2 while live is v4). Useful for rollback but adds UX complexity around what "live" means after restore. Track as a separate ticket.
  • Multi-store dispatch ([FEAT] Workspace Ingest view + Index→Ingest vocabulary cleanup #225 / 0.9.0).
  • Cross-doc version groups (releasing a corpus, not a single doc).

Dependent / blocked issues

These all assume the live chunks model and need either re-spec or to wait on this:

Recommended sequencing: this issue → re-spec the three above → implement them.

ADR required

The semantic change — "push targets a version, not live chunks" — is load-bearing enough to warrant an ADR under docs/architecture/adrs/. Topics to cover:

  • Why not keep the live-chunks model and just expose the hash better in the UI? (Answer: the data-integrity argument around hard-deleted chunks + the audit story).
  • Implicit version-on-push vs explicit version-required-before-push? (Trade-off: UX friction vs version-table growth.)
  • Migration story for the existing schema (the design doc on [FEATURE] Move store connection (URI/user/password) into the Store entity #279 already lands document_version_id reservation? No — out of scope for 0.6.1.).

Acceptance criteria

  • chunk_pushes rows carry a document_version_id. Repository + API tests assert it.
  • Push without a recent version auto-creates a CHUNKS version snapshot before the ingest call. Test asserts the version appears in History.
  • document_store_links state semantics rewritten in terms of versions:
    • Ingested = the latest version of the doc matches the version recorded on the link.
    • Stale = the doc has a newer version than the one in the link.
    • Failed = unchanged.
  • History drawer renders per-store badges on each version row.
  • Ingest tab uses version language (v3 in Neo4j Local), not hash language.
  • Diff endpoint accepts ?from=<version_id>&to=<version_id> and returns a chunk-by-chunk diff between two frozen snapshots.
  • ADR under docs/architecture/adrs/ documents the decision (implicit vs explicit, why versions over hash, migration plan).
  • Dependent issues [FEATURE] Sync button — re-embedding and OpenSearch update #95 / [FEATURE] Visual marking of modified chunks (dirty/synced/original) #93 / [FEATURE] Sync indicator — dirty chunks counter and progress bar #96 re-specced or explicitly noted as blocked.
  • Backend tests cover: push auto-creates version, push attaches existing version, restore-then-push uses the restored version.
  • Frontend tests cover: History drawer per-store badges, Ingest tab version-string rendering.

Impact / priority

  • Severity: P0 for 0.8.0 (Production ready). The "which version is in prod?" question is foundational for any production RAG deployment.
  • Why not 0.6.1: 0.6.1 is a stabilisation milestone; this is a model change with cross-cutting impact on at least three other open issues. Slipping it in would derail the v0.6.1 tag.
  • Why not 0.7.0: there is no 0.7.0 milestone yet (UI redesign milestone is named after the maquette work, not numbered). 0.8.0 (Production ready) is the natural home: production semantics need this.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must havearchitectureHexagonal architecture, ports & adaptersbackendBackend (Python/FastAPI)enhancementNew feature or requestfrontendFrontend (Vue/TypeScript)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions