refactor(source-backend-split): cleanup — relocate, re-key, flatten#176
Merged
Conversation
…tribute Move LocalChromaDBProvisioner from dataset_types/local_file_chromadb/ to vector_stores/chromadb_local/provisioner.py and rename its NAME from "local_file" (binding name) to "chromadb_local" (vector store name). Provisioning is a vector-store concern; two future bindings using the same vector store should share one running provisioner. Add BaseVectorStoreProvisioner to vector_stores/interfaces.py and PROVISIONER_CLS: ClassVar[type[BaseVectorStoreProvisioner] | None] to BaseVectorStore. Concrete vector stores declare PROVISIONER_CLS explicitly — None on read-only stores like Weaviate. Direct attribute access (no getattr fallback) so a future vector store that forgets to declare fails loud at first use instead of silently skipping its provisioner. DatasetTypeRegistry loses its provisioner methods and lazy registration; handlers now resolve the provisioner via a new private helper _get_provisioner_cls(dtype) that goes dataset_type → VECTOR_STORE_CLS → PROVISIONER_CLS. Eight handler call sites switched. provisioner_state table is untouched in this commit — DB row keying stays on dtype until the column re-key migration in the next commit.
…rite Provisioner ownership belongs to the vector store, not the binding — one running chroma subprocess can be referenced by any number of bindings that compose chromadb_local. This commit adds the vector_store_type column alongside the legacy dtype column, backfills existing rows, and re-keys repository + handler call paths to use the new column. A follow-up commit drops the legacy dtype column once the dual-write transition is complete. - Migration 764c34a2dd18 adds nullable vector_store_type, backfills chromadb_local for dtype='local_file', creates a UNIQUE INDEX (SQLite can't ALTER TABLE ADD CONSTRAINT). - ProvisionerStateRepository methods re-keyed: get_by_vector_store_type, get_running_by_vector_store_type, delete_by_vector_store_type. upsert_status / force_status_update take vector_store_type as lookup key + dtype for legacy dual-write. - DatasetHandler gains a _get_vector_store_type(dtype) helper; 18 call sites translate dtype to vector_store_type before repo calls. get_dataset / update_dataset / healthcheck_dataset switched to FK reads (get_by_id(dataset.provisioner_state_id)) — fewer hops, no translation, survives the dtype-column drop cleanly.
Completes the re-key from 764c34a2dd18. The dual-write phase is over: the legacy dtype column + its uniqueness guards are dropped, vector_store_type is promoted to NOT NULL, and the dtype parameter disappears from the upsert path. Internal handler helpers pivot to vector_store_type, looking up PROVISIONER_CLS through VECTOR_STORE_REGISTRY directly. Public endpoints still accept dtype and translate at the boundary. - Migration e95f227b8167 drops dtype + idx_provisioner_dtype, promotes vector_store_type to NOT NULL via batch_alter_table. - ProvisionerState entity loses the dtype field and the two deprecated __table_args__ entries; vector_store_type is non-nullable. - ProvisionerBusyError / InvalidProvisionerTransitionError become bare Exception subclasses — every caller forwards via str(e); the per-attribute state was unused. - ProvisionerStateRepository.upsert_status loses the dtype parameter + dual-write line. - DatasetHandler helpers (_get_provisioner_cls, _ensure_provisioner_running, _stop_provisioner) take vector_store_type. Startup/shutdown loops read state.vector_store_type directly; healthcheck derives provisioner_cls from the FK-loaded state. - ProvisionerInfoResponse.dtype renamed to vector_store_type.
Thin bindings hold one file each; the subdir buys nothing. Flatten
``dataset_types/{local_file_chromadb,weaviate_remote}/dataset_type.py``
and ``sources/noop/noop_source.py`` to flat modules at the parent
level. Filenames now describe the binding pair:
``dataset_types/local_file_chromadb.py`` and
``dataset_types/remote_weaviate.py`` (matching the registered NAME,
not the historical directory spelling).
- Registration paths in dataset_types/__init__.py and
sources/registry.py updated to the flat module paths.
- The one cross-binding import (remote_weaviate.py pulling NoOpSource)
updated to sources.noop_source.
- The NAME ClassVars on each binding class are unchanged; this is
purely a file layout change.
The ingest and search pipeline DTOs are contracts shared by sources, vector stores and bindings — they don't belong to any one layer. Move them to neutral homes so every consumer depends on a common base: - shared/ingest_types.py: IngestFile, IngestRequest, IngestContext - shared/search_types.py: SearchContext, SearchParameters, SearchResult, SearchedDocument Two files instead of one so the closures stay independent: ingest shape changes don't drive search shape changes, and endpoint code doesn't have to pull in ingest types just to do a query. With the DTOs out of dataset_types/interfaces.py and vector_stores/interfaces.py, the runtime cycle between those modules disappears. The TYPE_CHECKING block and `from __future__ import annotations` are no longer needed and have been removed; all references are plain runtime imports now.
Chunking is invoked exclusively by ingestable vector stores during ingest — sources don't see chunks, bindings don't see chunks. Move the module next to its consumers so the dependency arrow points inward (concrete vector store → shared vector_store utility) instead of outward into dataset_types/. The chunker docstring is also updated: it parses an IngestFile into text chunks plus per-page images and hands off to whichever IngestableVectorStore invoked it. Storage and retrieval stay with the vector store.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request implements a significant refactor and cleanup of the dataset type and provisioner architecture, as well as database schema migrations to support the new model. The main changes are the migration of provisioner ownership from dataset types to vector stores, the removal of dataset-type-level provisioner registration, and a re-keying of the
provisioner_statestable to usevector_store_typeinstead ofdtype. Additionally, the dataset type interfaces are streamlined, and related files are reorganized for clarity.Database schema migrations:
vector_store_typecolumn to theprovisioner_statestable, backfilled from the legacydtypecolumn, and created a unique index for it. This is a dual-write transition to prepare for re-keying provisioner state by vector store instead of dataset type.dtypecolumn fromprovisioner_statesand madevector_store_typeNOT NULL, completing the migration to keying by vector store. Legacy indexes and constraints ondtypewere also removed.Provisioner and dataset type architecture refactor:
BaseVectorStore.PROVISIONER_CLS; all related registration and interface code was deleted or updated. [1] [2] [3] [4]Interface and import cleanups:
interfaces.pyby removing now-obsolete provisioner interfaces and moving shared types (IngestContext,IngestRequest, etc.) to dedicated shared modules, updating imports accordingly. [1] [2]local_file_chromadb/dataset_type.pymoved tolocal_file_chromadb.py, and updated all relevant import paths. [1] [2] [3]Minor code and formatting improvements:
These changes collectively modernize the dataset/provisioner model, clarify ownership, and lay the groundwork for future extensibility.
Database migrations:
vector_store_typetoprovisioner_states, backfill fromdtype, and create a unique index.dtypefromprovisioner_states, makevector_store_typeNOT NULL, and remove related legacy indexes/constraints.Provisioner and dataset type architecture:
Codebase cleanup and reorganization:
Formatting and minor fixes: