Skip to content

feat!: expose deduplication run status and results via admin API #381#408

Draft
Fedir-Yatsenko wants to merge 6 commits into
developmentfrom
feat/381-deduplication-job-status
Draft

feat!: expose deduplication run status and results via admin API #381#408
Fedir-Yatsenko wants to merge 6 commits into
developmentfrom
feat/381-deduplication-job-status

Conversation

@Fedir-Yatsenko
Copy link
Copy Markdown
Collaborator

@Fedir-Yatsenko Fedir-Yatsenko commented May 26, 2026

Applicable issues

Description of changes

POST /admin/api/v1/channels/{id}/datasets/deduplicate was fire-and-forget: it returned 202 Accepted with no way for the caller to learn whether the run finished, succeeded, or failed. The per-dimension counts of remapped rows and deleted orphan documents only appeared in application logs, so admins had to inspect logs to confirm a result. The CLI compounded the issue by printing "Deduplication completed" as soon as the 202 landed, regardless of whether the background task had actually finished.

This PR introduces a DeduplicationJob record (mirroring the existing AutoUpdateJob pattern from #383) so deduplication is fully observable: callers can poll job status and retrieve a structured per-dimension summary, and the CLI now waits for real completion and displays the counts.

API changes

  • POST /channels/{channel_id}/datasets/deduplicate⚠️ breaking: response shape changed from Channel to DeduplicationJob. Status code remains 202 Accepted. The job is created with status=QUEUED, returned immediately, and processed in the background.
  • GET /channels/{channel_id}/datasets/deduplication-jobs — new, paginated list of dedup jobs for a channel.
  • GET /channels/deduplication-jobs/{job_id} — new, single-job lookup for polling status and reading counts.

DeduplicationJob exposes: id, channel_id, status (QUEUED → IN_PROGRESS → COMPLETED/FAILED), reason_for_failure, and four per-dimension count columns:

  • non_indicator_remapped, non_indicator_deleted
  • special_remapped, special_deleted

CLI

statgpt deduplicate -c <channel> now polls the new endpoint via the existing spinner pattern (mirroring import_handler):

  • Spinner shows live Deduplication status: … updates.
  • On COMPLETED, prints a Rich table with the four per-dimension counts.
  • On FAILED, prints reason_for_failure.

AdminClient gains get_deduplication_job(job_id) and deduplicate_channel(channel_id) now returns a DeduplicationJob for polling.

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

Replace the fire-and-forget deduplicate endpoint with a tracked
DeduplicationJob, mirroring the AutoUpdateJob pattern. POST
/datasets/deduplicate now returns the created job; new GET endpoints
expose list and by-id lookup so callers can poll until completion. Each
job records per-dimension counts of remapped rows and deleted orphan
documents on success, or reason_for_failure on error. The CLI polls the
job and prints a results table when the run finishes. The shared
_set_failed_status SQL moved into statgpt/admin/services/status_recovery
so the new channel-service recovery hook does not cycle imports through
the dataset service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Fedir-Yatsenko Fedir-Yatsenko self-assigned this May 26, 2026
@Fedir-Yatsenko Fedir-Yatsenko requested a review from ypldan as a code owner May 26, 2026 14:16
@Fedir-Yatsenko Fedir-Yatsenko added enhancement New feature or request python Pull requests that update python code labels May 26, 2026
@Fedir-Yatsenko Fedir-Yatsenko changed the title feat: expose deduplication run status and results via admin API #381 feat!: expose deduplication run status and results via admin API #381 May 26, 2026
Fedir-Yatsenko and others added 4 commits May 26, 2026 17:41
Move the QUEUED -> IN_PROGRESS -> COMPLETED/FAILED lifecycle into
process_deduplication_job, mirroring reload_channel_dataset_in_background.
The background task becomes a thin wrapper, the redundant channel_id arg is
dropped, and _set_deduplication_job_status mutates an in-scope model instead
of refetching. Also validate channel IDs in create_deduplication_jobs and
switch the CLI poll loop to PreprocessingStatusEnum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `return_exceptions=True` branches in `_process_jobs` and `_deduplicate_channels` never fire — both task wrappers already record per-job failures themselves. Also wrap `create_deduplication_jobs` so a channel deleted mid-run skips the dedup phase instead of aborting, and align the two new deduplication endpoints with the file's positional decorator style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Fedir-Yatsenko
Copy link
Copy Markdown
Collaborator Author

Fedir-Yatsenko commented May 27, 2026

/deploy-review

GitHub actions run: 26499168229

Stage Status
deploy-review Success ✅
matrix.application Skipped ➖

@Fedir-Yatsenko
Copy link
Copy Markdown
Collaborator Author

⏱️This PR has been deferred until the next release.

@Fedir-Yatsenko Fedir-Yatsenko marked this pull request as draft May 27, 2026 12:03
# Conflicts:
#	statgpt/admin/auto_update.py
#	statgpt/admin/routers/channel.py
#	statgpt/admin/services/channel.py
#	statgpt/common/vectorstore/__init__.py
#	statgpt/common/vectorstore/base.py
#	statgpt/common/vectorstore/pg_vector_store/pg_vector_store.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose deduplication run status and results via admin API

1 participant