feat!: expose deduplication run status and results via admin API #381#408
Draft
Fedir-Yatsenko wants to merge 6 commits into
Draft
feat!: expose deduplication run status and results via admin API #381#408Fedir-Yatsenko wants to merge 6 commits into
Fedir-Yatsenko wants to merge 6 commits into
Conversation
Replace the fire-and-forget deduplicate endpoint with a tracked DeduplicationJob, mirroring the AutoUpdateJob pattern. POST /datasets/deduplicate now returns the created job; new GET endpoints expose list and by-id lookup so callers can poll until completion. Each job records per-dimension counts of remapped rows and deleted orphan documents on success, or reason_for_failure on error. The CLI polls the job and prints a results table when the run finishes. The shared _set_failed_status SQL moved into statgpt/admin/services/status_recovery so the new channel-service recovery hook does not cycle imports through the dataset service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the QUEUED -> IN_PROGRESS -> COMPLETED/FAILED lifecycle into process_deduplication_job, mirroring reload_channel_dataset_in_background. The background task becomes a thin wrapper, the redundant channel_id arg is dropped, and _set_deduplication_job_status mutates an in-scope model instead of refetching. Also validate channel IDs in create_deduplication_jobs and switch the CLI poll loop to PreprocessingStatusEnum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `return_exceptions=True` branches in `_process_jobs` and `_deduplicate_channels` never fire — both task wrappers already record per-job failures themselves. Also wrap `create_deduplication_jobs` so a channel deleted mid-run skips the dedup phase instead of aborting, and align the two new deduplication endpoints with the file's positional decorator style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
/deploy-review
|
Collaborator
Author
|
⏱️This PR has been deferred until the next release. |
# Conflicts: # statgpt/admin/auto_update.py # statgpt/admin/routers/channel.py # statgpt/admin/services/channel.py # statgpt/common/vectorstore/__init__.py # statgpt/common/vectorstore/base.py # statgpt/common/vectorstore/pg_vector_store/pg_vector_store.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Applicable issues
Description of changes
POST /admin/api/v1/channels/{id}/datasets/deduplicatewas fire-and-forget: it returned202 Acceptedwith no way for the caller to learn whether the run finished, succeeded, or failed. The per-dimension counts of remapped rows and deleted orphan documents only appeared in application logs, so admins had to inspect logs to confirm a result. The CLI compounded the issue by printing "Deduplication completed" as soon as the 202 landed, regardless of whether the background task had actually finished.This PR introduces a
DeduplicationJobrecord (mirroring the existingAutoUpdateJobpattern from #383) so deduplication is fully observable: callers can poll job status and retrieve a structured per-dimension summary, and the CLI now waits for real completion and displays the counts.API changes
POST /channels/{channel_id}/datasets/deduplicate—ChanneltoDeduplicationJob. Status code remains202 Accepted. The job is created withstatus=QUEUED, returned immediately, and processed in the background.GET /channels/{channel_id}/datasets/deduplication-jobs— new, paginated list of dedup jobs for a channel.GET /channels/deduplication-jobs/{job_id}— new, single-job lookup for polling status and reading counts.DeduplicationJobexposes:id,channel_id,status(QUEUED → IN_PROGRESS → COMPLETED/FAILED),reason_for_failure, and four per-dimension count columns:non_indicator_remapped,non_indicator_deletedspecial_remapped,special_deletedCLI
statgpt deduplicate -c <channel>now polls the new endpoint via the existing spinner pattern (mirroringimport_handler):Deduplication status: …updates.COMPLETED, prints a Rich table with the four per-dimension counts.FAILED, printsreason_for_failure.AdminClientgainsget_deduplication_job(job_id)anddeduplicate_channel(channel_id)now returns aDeduplicationJobfor polling.Checklist
Reviewenvironment.By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.