Refactor: Consolidation WEB API & HTTP API for document delete api#14254
Refactor: Consolidation WEB API & HTTP API for document delete api#14254JinHai-CN merged 3 commits intoinfiniflow:mainfrom
Conversation
📝 WalkthroughWalkthroughConsolidates document deletion by removing legacy Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant REST as REST API (document_api.delete_documents)
participant KB as KnowledgebaseService
participant Doc as DocumentService
participant File as FileService / ThreadPool
Client->>REST: DELETE /datasets/{dataset_id}/documents\njson: { ids } or { delete_all: true }
REST->>KB: accessible(kb_id, user_id)
KB-->>REST: accessible / not accessible
alt accessible
REST->>Doc: if delete_all -> query(kb_id) else validate ids
Doc-->>REST: doc IDs (or validation errors)
REST->>Doc: check_duplicate_ids(doc_ids)
REST->>File: thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)
File-->>REST: success / errors
REST-->>Client: { deleted: <count> } or error
else not accessible
REST-->>Client: auth error (401)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)
666-748: Add audit logging for the destructive delete flow.This new endpoint should log the validated delete intent and outcome, without logging full document ID lists. That makes bulk deletes and
delete_alltraceable in production. As per coding guidelines,**/*.py: Add logging for new flows🪵 Proposed logging additions
if delete_all: doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)] + logging.info( + "Deleting documents from dataset %s: delete_all=%s, count=%s", + dataset_id, + delete_all, + len(doc_ids), + ) + # make sure each id is unique unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document") if duplicate_messages: logging.warning(f"duplicate_messages:{duplicate_messages}") else: @@ if errors: + logging.warning("Document deletion failed for dataset %s: %s", dataset_id, errors) return get_error_data_result(message=str(errors)) + logging.info("Deleted %s documents from dataset %s", len(doc_ids), dataset_id) return get_result(data={"deleted": len(doc_ids)})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 666 - 748, Add audit logging to the delete_documents flow to record the validated delete intent and outcome without printing full document id lists: after the access check and after preparing doc_ids (and after check_duplicate_ids) emit an audit log (using logging.info or logging.warning) that includes dataset_id, tenant_id, whether delete_all is true, and the number of documents to delete (len(doc_ids)) and note if duplicates were detected (duplicate_messages present) — do this before calling FileService.delete_docs; then log the deletion outcome after await thread_pool_exec(FileService.delete_docs, ...) including success/failure, count deleted (len(doc_ids) on success) and any error summary (do not include full ids). Reference functions/classes: delete_documents, KnowledgebaseService.accessible, check_duplicate_ids, DocumentService.query, FileService.delete_docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 721-740: Reject duplicate IDs and verify ownership before calling
FileService.delete_docs: if check_duplicate_ids(doc_ids, "document") returns any
duplicates, return an error instead of only logging; then preflight all doc_ids
by fetching their records (e.g., via DocumentService.query or a
DocumentService.get_by_ids helper) and ensure each returned document belongs to
the requested dataset_id (and tenant_id if applicable); if any requested id is
missing or belongs to a different dataset/tenant, return an error listing
offending ids; only after duplicates are absent and all ids are verified to
belong to dataset_id call FileService.delete_docs(doc_ids, tenant_id).
In `@web/src/hooks/use-document-request.ts`:
- Around line 319-330: The mutation in useRemoveDocument uses datasetId from
useParams() directly (datasetId!) which can be undefined for routes that supply
the dataset via the knowledgeId || id pattern used elsewhere; change the dataset
resolution to mirror the list/filter flow (derive datasetId using the same logic
as the other hooks — e.g., resolve const datasetId = knowledgeId || id or read
from search params the same way the document list/filter does) and pass that
resolved datasetId into deleteDocument(documentIds) so deleteDocument is never
called with undefined; update the useRemoveDocument function where mutationFn
calls deleteDocument to use the resolved id.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 666-748: Add audit logging to the delete_documents flow to record
the validated delete intent and outcome without printing full document id lists:
after the access check and after preparing doc_ids (and after
check_duplicate_ids) emit an audit log (using logging.info or logging.warning)
that includes dataset_id, tenant_id, whether delete_all is true, and the number
of documents to delete (len(doc_ids)) and note if duplicates were detected
(duplicate_messages present) — do this before calling FileService.delete_docs;
then log the deletion outcome after await
thread_pool_exec(FileService.delete_docs, ...) including success/failure, count
deleted (len(doc_ids) on success) and any error summary (do not include full
ids). Reference functions/classes: delete_documents,
KnowledgebaseService.accessible, check_duplicate_ids, DocumentService.query,
FileService.delete_docs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 88753fc8-625d-4da1-9c5e-c2f4f798b726
📒 Files selected for processing (15)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pyapi/apps/sdk/doc.pyapi/utils/validation_utils.pytest/testcases/test_http_api/test_file_management_within_dataset/test_delete_documents.pytest/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.pytest/testcases/test_sdk_api/test_file_management_within_dataset/test_delete_documents.pytest/testcases/test_web_api/test_chunk_app/test_create_chunk.pytest/testcases/test_web_api/test_chunk_app/test_update_chunk.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/conftest.pytest/testcases/test_web_api/test_document_app/test_rm_documents.pyweb/src/hooks/use-document-request.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
💤 Files with no reviewable changes (2)
- test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
- api/apps/document_app.py
| # Get documents to delete | ||
| doc_ids = req.get("ids") or [] | ||
| delete_all = req.get("delete_all", False) | ||
| if not delete_all and len(doc_ids) == 0: | ||
| return get_error_data_result(message=f"should either provide doc ids or set delete_all(true), dataset: {dataset_id}. ") | ||
|
|
||
| if len(doc_ids) > 0 and delete_all: | ||
| return get_error_data_result(message=f"should not provide both doc ids and delete_all(true), dataset: {dataset_id}. ") | ||
| if delete_all: | ||
| doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)] | ||
|
|
||
| # make sure each id is unique | ||
| unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document") | ||
| if duplicate_messages: | ||
| logging.warning(f"duplicate_messages:{duplicate_messages}") | ||
| else: | ||
| doc_ids = unique_doc_ids | ||
|
|
||
| # Delete documents using existing FileService.delete_docs | ||
| errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id) |
There was a problem hiding this comment.
Preflight document IDs before deleting.
This route checks access to dataset_id, but then passes arbitrary doc_ids to FileService.delete_docs, which deletes by document ID and resolves the document tenant internally. That allows a request scoped to one dataset to delete a document from another dataset if its ID is supplied. Also, duplicate IDs are only logged, so duplicate requests can partially delete and then fail.
Reject duplicates and verify every requested ID belongs to dataset_id before calling FileService.delete_docs.
🛡️ Proposed fix
if delete_all:
doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)]
# make sure each id is unique
unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
if duplicate_messages:
- logging.warning(f"duplicate_messages:{duplicate_messages}")
- else:
- doc_ids = unique_doc_ids
+ logging.warning(f"duplicate_messages:{duplicate_messages}")
+ return get_error_data_result(message="; ".join(duplicate_messages), code=RetCode.ARGUMENT_ERROR)
+ doc_ids = unique_doc_ids
+
+ dataset_doc_ids = set(KnowledgebaseService.list_documents_by_ids([dataset_id]))
+ missing_doc_ids = [doc_id for doc_id in doc_ids if doc_id not in dataset_doc_ids]
+ if missing_doc_ids:
+ return get_error_data_result(message=f"Document not found: {missing_doc_ids[0]}")
# Delete documents using existing FileService.delete_docs
errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 721 - 740, Reject
duplicate IDs and verify ownership before calling FileService.delete_docs: if
check_duplicate_ids(doc_ids, "document") returns any duplicates, return an error
instead of only logging; then preflight all doc_ids by fetching their records
(e.g., via DocumentService.query or a DocumentService.get_by_ids helper) and
ensure each returned document belongs to the requested dataset_id (and tenant_id
if applicable); if any requested id is missing or belongs to a different
dataset/tenant, return an error listing offending ids; only after duplicates are
absent and all ids are verified to belong to dataset_id call
FileService.delete_docs(doc_ids, tenant_id).
| export const useRemoveDocument = () => { | ||
| const queryClient = useQueryClient(); | ||
| const { id: datasetId } = useParams(); | ||
| const { | ||
| data, | ||
| isPending: loading, | ||
| mutateAsync, | ||
| } = useMutation({ | ||
| mutationKey: [DocumentApiAction.RemoveDocument], | ||
| mutationFn: async (documentIds: string | string[]) => { | ||
| const { data } = await kbService.documentRm({ doc_id: documentIds }); | ||
| const ids = Array.isArray(documentIds) ? documentIds : [documentIds]; | ||
| const { data } = await deleteDocument(datasetId!, ids); |
There was a problem hiding this comment.
Resolve the dataset id the same way as document list/filter.
Line 330 uses datasetId! from useParams(), but this file’s list/filter flows use knowledgeId || id. On routes where the dataset id comes from search params, delete calls /datasets/undefined/documents.
🐛 Proposed fix
export const useRemoveDocument = () => {
const queryClient = useQueryClient();
- const { id: datasetId } = useParams();
+ const { knowledgeId } = useGetKnowledgeSearchParams();
+ const { id } = useParams();
const {
data,
isPending: loading,
mutateAsync,
} = useMutation({
mutationKey: [DocumentApiAction.RemoveDocument],
mutationFn: async (documentIds: string | string[]) => {
const ids = Array.isArray(documentIds) ? documentIds : [documentIds];
- const { data } = await deleteDocument(datasetId!, ids);
+ const datasetId = knowledgeId || id;
+ if (!datasetId) {
+ throw new Error('Dataset ID is required');
+ }
+ const { data } = await deleteDocument(datasetId, ids);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export const useRemoveDocument = () => { | |
| const queryClient = useQueryClient(); | |
| const { id: datasetId } = useParams(); | |
| const { | |
| data, | |
| isPending: loading, | |
| mutateAsync, | |
| } = useMutation({ | |
| mutationKey: [DocumentApiAction.RemoveDocument], | |
| mutationFn: async (documentIds: string | string[]) => { | |
| const { data } = await kbService.documentRm({ doc_id: documentIds }); | |
| const ids = Array.isArray(documentIds) ? documentIds : [documentIds]; | |
| const { data } = await deleteDocument(datasetId!, ids); | |
| export const useRemoveDocument = () => { | |
| const queryClient = useQueryClient(); | |
| const { knowledgeId } = useGetKnowledgeSearchParams(); | |
| const { id } = useParams(); | |
| const { | |
| data, | |
| isPending: loading, | |
| mutateAsync, | |
| } = useMutation({ | |
| mutationKey: [DocumentApiAction.RemoveDocument], | |
| mutationFn: async (documentIds: string | string[]) => { | |
| const ids = Array.isArray(documentIds) ? documentIds : [documentIds]; | |
| const datasetId = knowledgeId || id; | |
| if (!datasetId) { | |
| throw new Error('Dataset ID is required'); | |
| } | |
| const { data } = await deleteDocument(datasetId, ids); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/src/hooks/use-document-request.ts` around lines 319 - 330, The mutation
in useRemoveDocument uses datasetId from useParams() directly (datasetId!) which
can be undefined for routes that supply the dataset via the knowledgeId || id
pattern used elsewhere; change the dataset resolution to mirror the list/filter
flow (derive datasetId using the same logic as the other hooks — e.g., resolve
const datasetId = knowledgeId || id or read from search params the same way the
document list/filter does) and pass that resolved datasetId into
deleteDocument(documentIds) so deleteDocument is never called with undefined;
update the useRemoveDocument function where mutationFn calls deleteDocument to
use the resolved id.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14254 +/- ##
=======================================
Coverage 96.66% 96.66%
=======================================
Files 10 10
Lines 690 690
Branches 108 108
=======================================
Hits 667 667
Misses 8 8
Partials 15 15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)
738-743:⚠️ Potential issue | 🔴 CriticalDeduplication logic is inverted — duplicates fall through unchanged.
When
check_duplicate_idsfinds duplicates (duplicate_messagesnon-empty), the code only logs and leavesdoc_idsas the original list with duplicates; theelsebranch is the only placedoc_ids = unique_doc_idshappens, which is the case where there were no duplicates to strip anyway. Result: duplicate IDs are passed toFileService.delete_docs, and the second attempt at the same ID will raise"Document not found!"(seeapi/db/services/file_service.pyL582-625), turning a duplicate-input request into a partial-delete + error response.This also overlaps with the previously raised ownership/preflight issue, where the suggested fix was to reject duplicates outright rather than silently deduplicate.
🛠️ Minimal fix (always use the deduplicated list)
- # make sure each id is unique - unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document") - if duplicate_messages: - logging.warning(f"duplicate_messages:{duplicate_messages}") - else: - doc_ids = unique_doc_ids + # make sure each id is unique + unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document") + if duplicate_messages: + logging.warning(f"duplicate_messages:{duplicate_messages}") + doc_ids = unique_doc_ids🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 738 - 743, The deduplication branch is inverted: when check_duplicate_ids returns duplicate_messages the code logs but keeps the original doc_ids; change this to reject duplicate input instead of falling through. In the handler where check_duplicate_ids(doc_ids, "document") is called, if duplicate_messages is non-empty return a 400/BadRequest indicating duplicate IDs (include duplicate_messages) and do not call FileService.delete_docs; otherwise set doc_ids = unique_doc_ids and proceed. Ensure you reference check_duplicate_ids, duplicate_messages, unique_doc_ids, doc_ids and FileService.delete_docs when making the change.
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)
735-736:delete_allmaterializes all documents in the dataset in memory.For large datasets,
[doc.id for doc in DocumentService.query(kb_id=dataset_id)]loads every document row just to extract IDs, thenFileService.delete_docsprocesses them serially. Consider batching (page through IDs) or adding a dedicatedDocumentService.list_ids_by_kbprojection to reduce memory/DB load, and/or cap the number of docs deletable per request.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 735 - 736, The current delete_all path materializes every Document row via DocumentService.query(kb_id=dataset_id) which can OOM for large datasets; change this to stream or page IDs and delete in batches: add a new projection/utility like DocumentService.list_ids_by_kb(kb_id, page_size) that yields only document IDs (or modify DocumentService.query to support id-only projection), then call FileService.delete_docs in fixed-size batches (and enforce an optional max_delete cap per request). Update the delete_all branch to iterate the paginated ID generator, collect a batch of IDs, call FileService.delete_docs(batch), and repeat until exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 746-751: FileService.delete_docs currently swallows partial
successes by concatenating exception messages and returning a non-empty errors
string so the API never reports how many docs were actually deleted; change
FileService.delete_docs to return structured results (e.g., (deleted_ids,
errors) or (deleted_count, errors_list)) and update the callsite in
document_api.py (where thread_pool_exec(FileService.delete_docs, doc_ids,
tenant_id) is invoked) to unpack that tuple, return get_result with the deleted
count when any docs were removed, and when returning get_error_data_result
include both the failed error details and the attempted/failed counts so clients
can reconcile partial failures instead of only receiving a generic error.
---
Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 738-743: The deduplication branch is inverted: when
check_duplicate_ids returns duplicate_messages the code logs but keeps the
original doc_ids; change this to reject duplicate input instead of falling
through. In the handler where check_duplicate_ids(doc_ids, "document") is
called, if duplicate_messages is non-empty return a 400/BadRequest indicating
duplicate IDs (include duplicate_messages) and do not call
FileService.delete_docs; otherwise set doc_ids = unique_doc_ids and proceed.
Ensure you reference check_duplicate_ids, duplicate_messages, unique_doc_ids,
doc_ids and FileService.delete_docs when making the change.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 735-736: The current delete_all path materializes every Document
row via DocumentService.query(kb_id=dataset_id) which can OOM for large
datasets; change this to stream or page IDs and delete in batches: add a new
projection/utility like DocumentService.list_ids_by_kb(kb_id, page_size) that
yields only document IDs (or modify DocumentService.query to support id-only
projection), then call FileService.delete_docs in fixed-size batches (and
enforce an optional max_delete cap per request). Update the delete_all branch to
iterate the paginated ID generator, collect a batch of IDs, call
FileService.delete_docs(batch), and repeat until exhausted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c83f74ee-2646-4c20-99f9-104abbbb0c4b
📒 Files selected for processing (6)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pytest/testcases/test_web_api/test_common.pyweb/src/hooks/use-document-request.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
💤 Files with no reviewable changes (1)
- api/apps/document_app.py
🚧 Files skipped from review as they are similar to previous changes (4)
- web/src/services/knowledge-service.ts
- web/src/utils/api.ts
- test/testcases/test_web_api/test_common.py
- web/src/hooks/use-document-request.ts
| errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id) | ||
|
|
||
| if errors: | ||
| return get_error_data_result(message=str(errors)) | ||
|
|
||
| return get_result(data={"deleted": len(doc_ids)}) |
There was a problem hiding this comment.
Partial-failure reporting loses successful deletions.
FileService.delete_docs iterates IDs and concatenates exception messages into a single string without stopping, so on partial failure it has already deleted some docs. Here any non-empty errors causes a generic error response and len(doc_ids) is never returned, so the caller cannot tell how many documents were actually removed. Consider (a) having delete_docs return (deleted_ids, errors) or (b) at minimum, including the attempted/failed counts in the error payload so clients can reconcile state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 746 - 751,
FileService.delete_docs currently swallows partial successes by concatenating
exception messages and returning a non-empty errors string so the API never
reports how many docs were actually deleted; change FileService.delete_docs to
return structured results (e.g., (deleted_ids, errors) or (deleted_count,
errors_list)) and update the callsite in document_api.py (where
thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id) is invoked) to
unpack that tuple, return get_result with the deleted count when any docs were
removed, and when returning get_error_data_result include both the failed error
details and the attempted/failed counts so clients can reconcile partial
failures instead of only receiving a generic error.
What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/rm
Http API - DELETE /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- DELETE /api/v1/datasets/<dataset_id>/documents
Type of change