feat!: expose deduplication run status and results via admin API #381 by Fedir-Yatsenko · Pull Request #408 · epam/statgpt-backend

Fedir-Yatsenko · 2026-05-26T14:16:02Z

Applicable issues

closes Expose deduplication run status and results via admin API #381

Description of changes

POST /admin/api/v1/channels/{id}/datasets/deduplicate was fire-and-forget: it returned 202 Accepted with no way for the caller to learn whether the run finished, succeeded, or failed. The per-dimension counts of remapped rows and deleted orphan documents only appeared in application logs, so admins had to inspect logs to confirm a result. The CLI compounded the issue by printing "Deduplication completed" as soon as the 202 landed, regardless of whether the background task had actually finished.

This PR introduces a DeduplicationJob record (mirroring the existing AutoUpdateJob pattern from #383) so deduplication is fully observable: callers can poll job status and retrieve a structured per-dimension summary, and the CLI now waits for real completion and displays the counts.

API changes

POST /channels/{channel_id}/datasets/deduplicate — ⚠️ breaking: response shape changed from Channel to DeduplicationJob. Status code remains 202 Accepted. The job is created with status=QUEUED, returned immediately, and processed in the background.
GET /channels/{channel_id}/datasets/deduplication-jobs — new, paginated list of dedup jobs for a channel.
GET /channels/deduplication-jobs/{job_id} — new, single-job lookup for polling status and reading counts.

DeduplicationJob exposes: id, channel_id, status (QUEUED → IN_PROGRESS → COMPLETED/FAILED), reason_for_failure, and four per-dimension count columns:

non_indicator_remapped, non_indicator_deleted
special_remapped, special_deleted

CLI

statgpt deduplicate -c <channel> now polls the new endpoint via the existing spinner pattern (mirroring import_handler):

Spinner shows live Deduplication status: … updates.
On COMPLETED, prints a Rich table with the four per-dimension counts.
On FAILED, prints reason_for_failure.

AdminClient gains get_deduplication_job(job_id) and deduplicate_channel(channel_id) now returns a DeduplicationJob for polling.

Checklist

Title of the pull request follows Conventional Commits specification
Deployed and tested in a Review environment.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

Replace the fire-and-forget deduplicate endpoint with a tracked DeduplicationJob, mirroring the AutoUpdateJob pattern. POST /datasets/deduplicate now returns the created job; new GET endpoints expose list and by-id lookup so callers can poll until completion. Each job records per-dimension counts of remapped rows and deleted orphan documents on success, or reason_for_failure on error. The CLI polls the job and prints a results table when the run finishes. The shared _set_failed_status SQL moved into statgpt/admin/services/status_recovery so the new channel-service recovery hook does not cycle imports through the dataset service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the QUEUED -> IN_PROGRESS -> COMPLETED/FAILED lifecycle into process_deduplication_job, mirroring reload_channel_dataset_in_background. The background task becomes a thin wrapper, the redundant channel_id arg is dropped, and _set_deduplication_job_status mutates an in-scope model instead of refetching. Also validate channel IDs in create_deduplication_jobs and switch the CLI poll loop to PreprocessingStatusEnum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `return_exceptions=True` branches in `_process_jobs` and `_deduplicate_channels` never fire — both task wrappers already record per-job failures themselves. Also wrap `create_deduplication_jobs` so a channel deleted mid-run skips the dedup phase instead of aborting, and align the two new deduplication endpoints with the file's positional decorator style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fedir-Yatsenko · 2026-05-27T08:09:46Z

/deploy-review

GitHub actions run: 26499168229

Stage	Status
deploy-review	Success ✅
matrix.application	Skipped ➖

Fedir-Yatsenko · 2026-05-27T12:02:20Z

⏱️This PR has been deferred until the next release.

# Conflicts: # statgpt/admin/auto_update.py # statgpt/admin/routers/channel.py # statgpt/admin/services/channel.py # statgpt/common/vectorstore/__init__.py # statgpt/common/vectorstore/base.py # statgpt/common/vectorstore/pg_vector_store/pg_vector_store.py

Fedir-Yatsenko self-assigned this May 26, 2026

Fedir-Yatsenko requested a review from ypldan as a code owner May 26, 2026 14:16

Fedir-Yatsenko added enhancement New feature or request python Pull requests that update python code labels May 26, 2026

Fedir-Yatsenko changed the title ~~feat: expose deduplication run status and results via admin API #381~~ feat!: expose deduplication run status and results via admin API #381 May 26, 2026

Fedir-Yatsenko and others added 4 commits May 26, 2026 17:41

Merge branch 'development' into feat/381-deduplication-job-status

231faf7

Merge branch 'development' into feat/381-deduplication-job-status

02f4a8f

Fedir-Yatsenko marked this pull request as draft May 27, 2026 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: expose deduplication run status and results via admin API #381#408

feat!: expose deduplication run status and results via admin API #381#408
Fedir-Yatsenko wants to merge 6 commits into
developmentfrom
feat/381-deduplication-job-status

Fedir-Yatsenko commented May 26, 2026 •

edited

Loading

Uh oh!

Fedir-Yatsenko commented May 27, 2026 •

edited by ai-dial-actions

Loading

Uh oh!

Fedir-Yatsenko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fedir-Yatsenko commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Applicable issues

Description of changes

API changes

CLI

Checklist

Uh oh!

Fedir-Yatsenko commented May 27, 2026 • edited by ai-dial-actions Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fedir-Yatsenko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fedir-Yatsenko commented May 26, 2026 •

edited

Loading

Fedir-Yatsenko commented May 27, 2026 •

edited by ai-dial-actions

Loading