Skip to content

feat(rag): add active generation schema and store (#438 PR1)#452

Open
sqhyz55 wants to merge 2 commits into
xorbitsai:mainfrom
sqhyz55:feat/rag-active-generation-schema
Open

feat(rag): add active generation schema and store (#438 PR1)#452
sqhyz55 wants to merge 2 commits into
xorbitsai:mainfrom
sqhyz55:feat/rag-active-generation-schema

Conversation

@sqhyz55
Copy link
Copy Markdown
Collaborator

@sqhyz55 sqhyz55 commented May 19, 2026

Introduce nullable generation_id columns, active_generations table, and ActiveGenerationStore for crash-safe re-chunking. Keep legacy chunk/embedding merge keys until ingestion publishes generations in a follow-up PR.

Introduce nullable generation_id columns, active_generations table, and
ActiveGenerationStore for crash-safe re-chunking. Keep legacy chunk/embedding
merge keys until ingestion publishes generations in a follow-up PR.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ActiveGenerationStore and its LanceDB implementation to track published searchable generations, updating the schemas for chunks and embeddings tables to include generation_id and config_hash while providing migration utilities for existing data. Feedback recommends optimizing performance by caching table existence checks and improving the atomicity of the publication process to reduce database round-trips.

Comment on lines +3177 to +3180
def _ensure_table(self) -> None:
from ..LanceDB.schema_manager import ensure_active_generations_table

ensure_active_generations_table(self._get_sync_connection())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _ensure_table method is called on every database operation (publish, get, list). Since it performs a directory listing via list_table_names to check for table existence, this can become a performance bottleneck during high-volume ingestion. Caching the result of this check after the first successful execution would significantly reduce overhead.

Suggested change
def _ensure_table(self) -> None:
from ..LanceDB.schema_manager import ensure_active_generations_table
ensure_active_generations_table(self._get_sync_connection())
def _ensure_table(self) -> None:
if getattr(self, "_table_ensured", False):
return
from ..LanceDB.schema_manager import ensure_active_generations_table
ensure_active_generations_table(self._get_sync_connection())
self._table_ensured = True

Comment on lines +3205 to +3242
now = datetime.now(timezone.utc).replace(tzinfo=None)
existing = self.get_active_generation(
collection=collection,
doc_id=doc_id,
parse_hash=parse_hash,
user_id=user_id,
model_tag=model_tag,
)
created_at = existing["created_at"] if existing else now

record = {
"collection": collection,
"doc_id": doc_id,
"parse_hash": parse_hash,
"user_id": user_id,
"model_tag": normalized_tag,
"generation_id": generation_id,
"config_hash": config_hash,
"created_at": created_at,
"updated_at": now,
"published_at": now,
"operator": operator or "unknown",
}

(
table.merge_insert(
on=[
"collection",
"doc_id",
"parse_hash",
"user_id",
"model_tag",
]
)
.when_matched_update_all()
.when_not_matched_insert_all()
.execute([record])
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation performs an extra round-trip to the database to fetch the existing record just to preserve the created_at timestamp. This can be achieved more efficiently and atomically using merge_insert with a specific when_matched_update clause that excludes created_at. This avoids the overhead of a separate search operation.

        try:
            now = datetime.now(timezone.utc).replace(tzinfo=None)
            record = {
                "collection": collection,
                "doc_id": doc_id,
                "parse_hash": parse_hash,
                "user_id": user_id,
                "model_tag": normalized_tag,
                "generation_id": generation_id,
                "config_hash": config_hash,
                "created_at": now,
                "updated_at": now,
                "published_at": now,
                "operator": operator or "unknown",
            }

            (
                table.merge_insert(
                    on=[
                        "collection",
                        "doc_id",
                        "parse_hash",
                        "user_id",
                        "model_tag",
                    ]
                )
                .when_matched_update(
                    updates={
                        "generation_id": "source.generation_id",
                        "config_hash": "source.config_hash",
                        "updated_at": "source.updated_at",
                        "published_at": "source.published_at",
                        "operator": "source.operator",
                    }
                )
                .when_not_matched_insert_all()
                .execute([record])
            )
        finally:
            _safe_close_table(table)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前依赖的 LanceDB 0.30.2 的 merge_insert 只有:

when_matched_update_all()
when_not_matched_insert_all()
when_not_matched_by_source_delete()
没有 when_matched_update(updates={...})

…tsai#438 PR1)

Address PR review on LanceDBActiveGenerationStore:

- Cache _table_ensured so publish/get/list skip list_table_names after the
  first call.
- Replace the extra get_active_generation() round-trip in
  publish_active_generation with an inline filter on the already-opened
  table while keeping created_at preserved across republishes.
- Add regression tests for republish created_at semantics and ensure-table
  caching, plus apply ruff format to the generation schema test file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants