Skip to content

refactor(ingestion): drop legacy file_* columns#177

Merged
shubham3121 merged 4 commits into
mainfrom
feat/drop-legacy-ingestion-columns
Jun 9, 2026
Merged

refactor(ingestion): drop legacy file_* columns#177
shubham3121 merged 4 commits into
mainfrom
feat/drop-legacy-ingestion-columns

Conversation

@shubham3121

Copy link
Copy Markdown
Member

This pull request completes the migration of the ingestion pipeline to use source-agnostic identifiers, removing legacy file-specific columns and logic throughout the backend and frontend. The ingestion_jobs table and related code now rely solely on external_id and fingerprint as unique and required fields, simplifying the data model and APIs. The frontend is updated to reflect these changes, removing references to file-specific fields.

Database schema and data model migration:

  • Dropped legacy file-specific columns (file_path, file_name, file_size, file_mtime_ns) and the associated uniqueness constraint from the ingestion_jobs table. The new uniqueness constraint is on (dataset_id, external_id), and both external_id and fingerprint are now NOT NULL. Migration includes upgrade and downgrade logic.
  • Updated the IngestionJob SQLModel entity to remove deprecated file-specific fields and enforce non-nullable external_id and fingerprint. Table constraints and indexes are updated to match the new schema. [1] [2]

Backend logic and API updates:

  • Refactored ingestion manager and repository logic to remove all dual-write and legacy file field handling, including upsert logic, event handling, and job processing. All operations now use only external_id and fingerprint. [1] [2] [3] [4] [5] [6] [7]
  • Updated ingestion job response schemas to remove file-specific fields and require external_id and fingerprint.
  • Enhanced NoOpSource to include an IS_NOOP flag, allowing the ingestion manager to skip unnecessary tasks for externally-fed sources.
  • Improved dataset source detection logic to skip NoOpSource instances, reducing unnecessary bookkeeping.

Frontend updates:

  • Updated TypeScript API types and all frontend components to remove file-specific fields from ingestion jobs and use external_id and fingerprint instead. Display logic now infers the filename from external_id. [1] [2] [3] [4]

…sources

Completes the source-agnostic ingestion_jobs schema and removes the
last of the file-shaped scaffolding:

- Migration drops file_path / file_name / file_size / file_mtime_ns
  and the inline (dataset_id, file_path) UNIQUE; external_id and
  fingerprint become NOT NULL, with a UNIQUE INDEX on (dataset_id,
  external_id) taking over as the uniqueness invariant.
- IngestionJob entity, repository upsert and IngestionJobResponse
  schema lose every reference to the file_* fields. The manager's
  _derive_legacy_file_fields helper and the dual-write threading
  through upsert_by_external_id are gone.
- NoOpSource declares IS_NOOP = True; IngestionManager._has_source
  reads it and skips spawning per-dataset tasks for externally-fed
  bindings (remote Weaviate today).
- Frontend: IngestionJobResponse TS interface mirrors the new shape;
  DatasetDetailPage renders the basename of external_id and drops the
  now-orphan formatFileSize helper; EndpointDetailPage filters by
  external_id.
@shubham3121 shubham3121 force-pushed the feat/drop-legacy-ingestion-columns branch 3 times, most recently from d0c9bd0 to cd3b4ae Compare June 8, 2026 13:25
@shubham3121 shubham3121 merged commit a2b2dc3 into main Jun 9, 2026
3 checks passed
@shubham3121 shubham3121 deleted the feat/drop-legacy-ingestion-columns branch June 9, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant