Skip to content

refactor(source-backend-split): cleanup — relocate, re-key, flatten#176

Merged
shubham3121 merged 6 commits into
mainfrom
feat/cleanup-source-backend-split
Jun 5, 2026
Merged

refactor(source-backend-split): cleanup — relocate, re-key, flatten#176
shubham3121 merged 6 commits into
mainfrom
feat/cleanup-source-backend-split

Conversation

@shubham3121

@shubham3121 shubham3121 commented Jun 5, 2026

Copy link
Copy Markdown
Member

This pull request implements a significant refactor and cleanup of the dataset type and provisioner architecture, as well as database schema migrations to support the new model. The main changes are the migration of provisioner ownership from dataset types to vector stores, the removal of dataset-type-level provisioner registration, and a re-keying of the provisioner_states table to use vector_store_type instead of dtype. Additionally, the dataset type interfaces are streamlined, and related files are reorganized for clarity.

Database schema migrations:

  • Added a new vector_store_type column to the provisioner_states table, backfilled from the legacy dtype column, and created a unique index for it. This is a dual-write transition to prepare for re-keying provisioner state by vector store instead of dataset type.
  • Dropped the legacy dtype column from provisioner_states and made vector_store_type NOT NULL, completing the migration to keying by vector store. Legacy indexes and constraints on dtype were also removed.

Provisioner and dataset type architecture refactor:

  • Removed the concept of dataset-type-level provisioners. Provisioners are now owned by vector stores and referenced via BaseVectorStore.PROVISIONER_CLS; all related registration and interface code was deleted or updated. [1] [2] [3] [4]
  • Updated the dataset type registry to only track dataset-type bindings, not provisioners. The registry now provides clearer docstrings and error messages. [1] [2]

Interface and import cleanups:

  • Simplified interfaces.py by removing now-obsolete provisioner interfaces and moving shared types (IngestContext, IngestRequest, etc.) to dedicated shared modules, updating imports accordingly. [1] [2]
  • Renamed and reorganized dataset type files for clarity, e.g., local_file_chromadb/dataset_type.py moved to local_file_chromadb.py, and updated all relevant import paths. [1] [2] [3]

Minor code and formatting improvements:

  • Cleaned up formatting and improved code style in analytics entities and handlers for better readability. [1] [2] [3] [4] [5] [6] [7]
  • Fixed import paths for chunking components to reflect new vector store organization.

These changes collectively modernize the dataset/provisioner model, clarify ownership, and lay the groundwork for future extensibility.

Database migrations:

  • Add vector_store_type to provisioner_states, backfill from dtype, and create a unique index.
  • Drop dtype from provisioner_states, make vector_store_type NOT NULL, and remove related legacy indexes/constraints.

Provisioner and dataset type architecture:

  • Remove dataset-type-level provisioners; provisioners are now owned by vector stores. [1] [2] [3]
  • Update registry to only track dataset-type bindings and improve error handling. [1] [2]

Codebase cleanup and reorganization:

  • Move shared types to dedicated modules and update imports throughout. [1] [2]
  • Rename and reorganize dataset type files for clarity and maintainability. [1] [2] [3]

Formatting and minor fixes:

  • Improve formatting and code style in analytics and other modules. [1] [2] [3] [4] [5] [6] [7]
  • Fix import paths for chunking logic.

…tribute

Move LocalChromaDBProvisioner from dataset_types/local_file_chromadb/ to
vector_stores/chromadb_local/provisioner.py and rename its NAME from
"local_file" (binding name) to "chromadb_local" (vector store name).
Provisioning is a vector-store concern; two future bindings using the
same vector store should share one running provisioner.

Add BaseVectorStoreProvisioner to vector_stores/interfaces.py and
PROVISIONER_CLS: ClassVar[type[BaseVectorStoreProvisioner] | None] to
BaseVectorStore. Concrete vector stores declare PROVISIONER_CLS
explicitly — None on read-only stores like Weaviate. Direct attribute
access (no getattr fallback) so a future vector store that forgets to
declare fails loud at first use instead of silently skipping its
provisioner.

DatasetTypeRegistry loses its provisioner methods and lazy
registration; handlers now resolve the provisioner via a new private
helper _get_provisioner_cls(dtype) that goes dataset_type →
VECTOR_STORE_CLS → PROVISIONER_CLS. Eight handler call sites switched.

provisioner_state table is untouched in this commit — DB row keying
stays on dtype until the column re-key migration in the next commit.
…rite

Provisioner ownership belongs to the vector store, not the binding —
one running chroma subprocess can be referenced by any number of
bindings that compose chromadb_local. This commit adds the
vector_store_type column alongside the legacy dtype column, backfills
existing rows, and re-keys repository + handler call paths to use the
new column. A follow-up commit drops the legacy dtype column once the
dual-write transition is complete.

- Migration 764c34a2dd18 adds nullable vector_store_type, backfills
  chromadb_local for dtype='local_file', creates a UNIQUE INDEX
  (SQLite can't ALTER TABLE ADD CONSTRAINT).
- ProvisionerStateRepository methods re-keyed:
  get_by_vector_store_type, get_running_by_vector_store_type,
  delete_by_vector_store_type. upsert_status / force_status_update
  take vector_store_type as lookup key + dtype for legacy dual-write.
- DatasetHandler gains a _get_vector_store_type(dtype) helper; 18
  call sites translate dtype to vector_store_type before repo calls.
  get_dataset / update_dataset / healthcheck_dataset switched to FK
  reads (get_by_id(dataset.provisioner_state_id)) — fewer hops, no
  translation, survives the dtype-column drop cleanly.
Completes the re-key from 764c34a2dd18. The dual-write phase is over:
the legacy dtype column + its uniqueness guards are dropped,
vector_store_type is promoted to NOT NULL, and the dtype parameter
disappears from the upsert path. Internal handler helpers pivot to
vector_store_type, looking up PROVISIONER_CLS through
VECTOR_STORE_REGISTRY directly. Public endpoints still accept dtype
and translate at the boundary.

- Migration e95f227b8167 drops dtype + idx_provisioner_dtype, promotes
  vector_store_type to NOT NULL via batch_alter_table.
- ProvisionerState entity loses the dtype field and the two deprecated
  __table_args__ entries; vector_store_type is non-nullable.
- ProvisionerBusyError / InvalidProvisionerTransitionError become bare
  Exception subclasses — every caller forwards via str(e); the
  per-attribute state was unused.
- ProvisionerStateRepository.upsert_status loses the dtype parameter
  + dual-write line.
- DatasetHandler helpers (_get_provisioner_cls,
  _ensure_provisioner_running, _stop_provisioner) take
  vector_store_type. Startup/shutdown loops read state.vector_store_type
  directly; healthcheck derives provisioner_cls from the FK-loaded state.
- ProvisionerInfoResponse.dtype renamed to vector_store_type.
Thin bindings hold one file each; the subdir buys nothing. Flatten
``dataset_types/{local_file_chromadb,weaviate_remote}/dataset_type.py``
and ``sources/noop/noop_source.py`` to flat modules at the parent
level. Filenames now describe the binding pair:
``dataset_types/local_file_chromadb.py`` and
``dataset_types/remote_weaviate.py`` (matching the registered NAME,
not the historical directory spelling).

- Registration paths in dataset_types/__init__.py and
  sources/registry.py updated to the flat module paths.
- The one cross-binding import (remote_weaviate.py pulling NoOpSource)
  updated to sources.noop_source.
- The NAME ClassVars on each binding class are unchanged; this is
  purely a file layout change.
@shubham3121 shubham3121 changed the title refactor(source-backend-split): cleanup — relocate, re-key, flatten [WIP] refactor(source-backend-split): cleanup — relocate, re-key, flatten Jun 5, 2026
The ingest and search pipeline DTOs are contracts shared by sources,
vector stores and bindings — they don't belong to any one layer. Move
them to neutral homes so every consumer depends on a common base:

- shared/ingest_types.py: IngestFile, IngestRequest, IngestContext
- shared/search_types.py: SearchContext, SearchParameters,
  SearchResult, SearchedDocument

Two files instead of one so the closures stay independent: ingest
shape changes don't drive search shape changes, and endpoint code
doesn't have to pull in ingest types just to do a query.

With the DTOs out of dataset_types/interfaces.py and
vector_stores/interfaces.py, the runtime cycle between those modules
disappears. The TYPE_CHECKING block and `from __future__ import
annotations` are no longer needed and have been removed; all
references are plain runtime imports now.
Chunking is invoked exclusively by ingestable vector stores during
ingest — sources don't see chunks, bindings don't see chunks. Move
the module next to its consumers so the dependency arrow points
inward (concrete vector store → shared vector_store utility) instead
of outward into dataset_types/.

The chunker docstring is also updated: it parses an IngestFile into
text chunks plus per-page images and hands off to whichever
IngestableVectorStore invoked it. Storage and retrieval stay with
the vector store.
@shubham3121 shubham3121 changed the title [WIP] refactor(source-backend-split): cleanup — relocate, re-key, flatten refactor(source-backend-split): cleanup — relocate, re-key, flatten Jun 5, 2026
@shubham3121 shubham3121 merged commit 30dfe09 into main Jun 5, 2026
3 checks passed
@shubham3121 shubham3121 deleted the feat/cleanup-source-backend-split branch June 5, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant