refactor(source-backend-split): cleanup — relocate, re-key, flatten by shubham3121 · Pull Request #176 · OpenMined/syft-space

shubham3121 · 2026-06-05T08:39:10Z

This pull request implements a significant refactor and cleanup of the dataset type and provisioner architecture, as well as database schema migrations to support the new model. The main changes are the migration of provisioner ownership from dataset types to vector stores, the removal of dataset-type-level provisioner registration, and a re-keying of the provisioner_states table to use vector_store_type instead of dtype. Additionally, the dataset type interfaces are streamlined, and related files are reorganized for clarity.

Database schema migrations:

Added a new vector_store_type column to the provisioner_states table, backfilled from the legacy dtype column, and created a unique index for it. This is a dual-write transition to prepare for re-keying provisioner state by vector store instead of dataset type.
Dropped the legacy dtype column from provisioner_states and made vector_store_type NOT NULL, completing the migration to keying by vector store. Legacy indexes and constraints on dtype were also removed.

Provisioner and dataset type architecture refactor:

Removed the concept of dataset-type-level provisioners. Provisioners are now owned by vector stores and referenced via BaseVectorStore.PROVISIONER_CLS; all related registration and interface code was deleted or updated. [1] [2] [3] [4]
Updated the dataset type registry to only track dataset-type bindings, not provisioners. The registry now provides clearer docstrings and error messages. [1] [2]

Interface and import cleanups:

Simplified interfaces.py by removing now-obsolete provisioner interfaces and moving shared types (IngestContext, IngestRequest, etc.) to dedicated shared modules, updating imports accordingly. [1] [2]
Renamed and reorganized dataset type files for clarity, e.g., local_file_chromadb/dataset_type.py moved to local_file_chromadb.py, and updated all relevant import paths. [1] [2] [3]

Minor code and formatting improvements:

Cleaned up formatting and improved code style in analytics entities and handlers for better readability. [1] [2] [3] [4] [5] [6] [7]
Fixed import paths for chunking components to reflect new vector store organization.

These changes collectively modernize the dataset/provisioner model, clarify ownership, and lay the groundwork for future extensibility.

Database migrations:

Add vector_store_type to provisioner_states, backfill from dtype, and create a unique index.
Drop dtype from provisioner_states, make vector_store_type NOT NULL, and remove related legacy indexes/constraints.

Provisioner and dataset type architecture:

Remove dataset-type-level provisioners; provisioners are now owned by vector stores. [1] [2] [3]
Update registry to only track dataset-type bindings and improve error handling. [1] [2]

Codebase cleanup and reorganization:

Move shared types to dedicated modules and update imports throughout. [1] [2]
Rename and reorganize dataset type files for clarity and maintainability. [1] [2] [3]

Formatting and minor fixes:

Improve formatting and code style in analytics and other modules. [1] [2] [3] [4] [5] [6] [7]
Fix import paths for chunking logic.

…tribute Move LocalChromaDBProvisioner from dataset_types/local_file_chromadb/ to vector_stores/chromadb_local/provisioner.py and rename its NAME from "local_file" (binding name) to "chromadb_local" (vector store name). Provisioning is a vector-store concern; two future bindings using the same vector store should share one running provisioner. Add BaseVectorStoreProvisioner to vector_stores/interfaces.py and PROVISIONER_CLS: ClassVar[type[BaseVectorStoreProvisioner] | None] to BaseVectorStore. Concrete vector stores declare PROVISIONER_CLS explicitly — None on read-only stores like Weaviate. Direct attribute access (no getattr fallback) so a future vector store that forgets to declare fails loud at first use instead of silently skipping its provisioner. DatasetTypeRegistry loses its provisioner methods and lazy registration; handlers now resolve the provisioner via a new private helper _get_provisioner_cls(dtype) that goes dataset_type → VECTOR_STORE_CLS → PROVISIONER_CLS. Eight handler call sites switched. provisioner_state table is untouched in this commit — DB row keying stays on dtype until the column re-key migration in the next commit.

…rite Provisioner ownership belongs to the vector store, not the binding — one running chroma subprocess can be referenced by any number of bindings that compose chromadb_local. This commit adds the vector_store_type column alongside the legacy dtype column, backfills existing rows, and re-keys repository + handler call paths to use the new column. A follow-up commit drops the legacy dtype column once the dual-write transition is complete. - Migration 764c34a2dd18 adds nullable vector_store_type, backfills chromadb_local for dtype='local_file', creates a UNIQUE INDEX (SQLite can't ALTER TABLE ADD CONSTRAINT). - ProvisionerStateRepository methods re-keyed: get_by_vector_store_type, get_running_by_vector_store_type, delete_by_vector_store_type. upsert_status / force_status_update take vector_store_type as lookup key + dtype for legacy dual-write. - DatasetHandler gains a _get_vector_store_type(dtype) helper; 18 call sites translate dtype to vector_store_type before repo calls. get_dataset / update_dataset / healthcheck_dataset switched to FK reads (get_by_id(dataset.provisioner_state_id)) — fewer hops, no translation, survives the dtype-column drop cleanly.

Completes the re-key from 764c34a2dd18. The dual-write phase is over: the legacy dtype column + its uniqueness guards are dropped, vector_store_type is promoted to NOT NULL, and the dtype parameter disappears from the upsert path. Internal handler helpers pivot to vector_store_type, looking up PROVISIONER_CLS through VECTOR_STORE_REGISTRY directly. Public endpoints still accept dtype and translate at the boundary. - Migration e95f227b8167 drops dtype + idx_provisioner_dtype, promotes vector_store_type to NOT NULL via batch_alter_table. - ProvisionerState entity loses the dtype field and the two deprecated __table_args__ entries; vector_store_type is non-nullable. - ProvisionerBusyError / InvalidProvisionerTransitionError become bare Exception subclasses — every caller forwards via str(e); the per-attribute state was unused. - ProvisionerStateRepository.upsert_status loses the dtype parameter + dual-write line. - DatasetHandler helpers (_get_provisioner_cls, _ensure_provisioner_running, _stop_provisioner) take vector_store_type. Startup/shutdown loops read state.vector_store_type directly; healthcheck derives provisioner_cls from the FK-loaded state. - ProvisionerInfoResponse.dtype renamed to vector_store_type.

Thin bindings hold one file each; the subdir buys nothing. Flatten ``dataset_types/{local_file_chromadb,weaviate_remote}/dataset_type.py`` and ``sources/noop/noop_source.py`` to flat modules at the parent level. Filenames now describe the binding pair: ``dataset_types/local_file_chromadb.py`` and ``dataset_types/remote_weaviate.py`` (matching the registered NAME, not the historical directory spelling). - Registration paths in dataset_types/__init__.py and sources/registry.py updated to the flat module paths. - The one cross-binding import (remote_weaviate.py pulling NoOpSource) updated to sources.noop_source. - The NAME ClassVars on each binding class are unchanged; this is purely a file layout change.

The ingest and search pipeline DTOs are contracts shared by sources, vector stores and bindings — they don't belong to any one layer. Move them to neutral homes so every consumer depends on a common base: - shared/ingest_types.py: IngestFile, IngestRequest, IngestContext - shared/search_types.py: SearchContext, SearchParameters, SearchResult, SearchedDocument Two files instead of one so the closures stay independent: ingest shape changes don't drive search shape changes, and endpoint code doesn't have to pull in ingest types just to do a query. With the DTOs out of dataset_types/interfaces.py and vector_stores/interfaces.py, the runtime cycle between those modules disappears. The TYPE_CHECKING block and `from __future__ import annotations` are no longer needed and have been removed; all references are plain runtime imports now.

Chunking is invoked exclusively by ingestable vector stores during ingest — sources don't see chunks, bindings don't see chunks. Move the module next to its consumers so the dependency arrow points inward (concrete vector store → shared vector_store utility) instead of outward into dataset_types/. The chunker docstring is also updated: it parses an IngestFile into text chunks plus per-page images and hands off to whichever IngestableVectorStore invoked it. Storage and retrieval stay with the vector store.

shubham3121 added 4 commits June 4, 2026 19:04

shubham3121 changed the title ~~refactor(source-backend-split): cleanup — relocate, re-key, flatten~~ [WIP] refactor(source-backend-split): cleanup — relocate, re-key, flatten Jun 5, 2026

shubham3121 added 2 commits June 5, 2026 16:25

shubham3121 changed the title ~~[WIP] refactor(source-backend-split): cleanup — relocate, re-key, flatten~~ refactor(source-backend-split): cleanup — relocate, re-key, flatten Jun 5, 2026

shubham3121 merged commit 30dfe09 into main Jun 5, 2026
3 checks passed

shubham3121 deleted the feat/cleanup-source-backend-split branch June 5, 2026 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(source-backend-split): cleanup — relocate, re-key, flatten#176

refactor(source-backend-split): cleanup — relocate, re-key, flatten#176
shubham3121 merged 6 commits into
mainfrom
feat/cleanup-source-backend-split

shubham3121 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shubham3121 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shubham3121 commented Jun 5, 2026 •

edited

Loading