Skip to content

Refactor: Consolidation WEB API & HTTP API for document delete api#14254

Merged
JinHai-CN merged 3 commits intoinfiniflow:mainfrom
xugangqiang:merge-doc-delete
Apr 22, 2026
Merged

Refactor: Consolidation WEB API & HTTP API for document delete api#14254
JinHai-CN merged 3 commits intoinfiniflow:mainfrom
xugangqiang:merge-doc-delete

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 21, 2026

What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/rm
Http API - DELETE /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- DELETE /api/v1/datasets/<dataset_id>/documents

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Consolidates document deletion by removing legacy POST /rm and adding a dataset-scoped DELETE /datasets/<dataset_id>/documents endpoint; updates SDK, frontend API, hooks, and tests to use the new request shape and async deletion flow.

Changes

Cohort / File(s) Summary
Backend — REST endpoint added / legacy removed
api/apps/document_app.py, api/apps/restful_apis/document_api.py, api/apps/sdk/doc.py
Removed legacy POST /rm handler; added DELETE /datasets/<dataset_id>/documents REST handler with JSON DeleteDocumentReq parsing, dataset accessibility check, ids/delete_all validation, duplicate-id check, and async deletion via FileService.delete_docs/thread_pool_exec. Removed duplicate SDK route.
Validation models
api/utils/validation_utils.py
Added DeleteDocumentReq(DeleteReq) request model for document deletion requests.
Frontend — API surface / service / hooks
web/src/utils/api.ts, web/src/services/knowledge-service.ts, web/src/hooks/use-document-request.ts
Removed documentRm mapping; changed documentDelete to function taking datasetId; added deleteDocument(datasetId, ids) service function; updated hook useRemoveDocument to call new API and normalize ids.
Backend tests (HTTP / SDK / unit) updates
test/testcases/test_http_api/test_file_management_within_dataset/test_delete_documents.py, test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py, test/testcases/test_sdk_api/test_file_management_within_dataset/test_delete_documents.py
Adjusted expected HTTP auth codes/messages; replaced doc_id-based tests with ids/delete_all validation and field-level errors; removed SDK route unit test covering old handler.
Web tests / helpers / fixtures
test/testcases/test_web_api/test_chunk_app/test_create_chunk.py, test/testcases/test_web_api/test_chunk_app/test_update_chunk.py, test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/conftest.py, test/testcases/test_web_api/test_document_app/test_rm_documents.py
Updated delete_document helper signature to accept dataset_id and send DELETE /datasets/<dataset_id>/documents with {"ids": [...]} or {"delete_all": true}; updated fixtures and tests to use new signature and adjusted assertions.
Frontend usage
web/src/services/..., web/src/hooks/...
Rewired frontend calls to use the new dataset-scoped delete API and adjusted error messaging in one hook.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant REST as REST API (document_api.delete_documents)
  participant KB as KnowledgebaseService
  participant Doc as DocumentService
  participant File as FileService / ThreadPool

  Client->>REST: DELETE /datasets/{dataset_id}/documents\njson: { ids } or { delete_all: true }
  REST->>KB: accessible(kb_id, user_id)
  KB-->>REST: accessible / not accessible
  alt accessible
    REST->>Doc: if delete_all -> query(kb_id) else validate ids
    Doc-->>REST: doc IDs (or validation errors)
    REST->>Doc: check_duplicate_ids(doc_ids)
    REST->>File: thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)
    File-->>REST: success / errors
    REST-->>Client: { deleted: <count> } or error
  else not accessible
    REST-->>Client: auth error (401)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • JinHai-CN
  • yingfeng
  • yuzhichang

Poem

🐰 I hopped from /rm to /datasets wide,
ids in a list, no more single-slide,
Threads hum in clusters to tidy each file,
Deletions now tidy, efficient, and agile. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective: consolidating two document deletion endpoints into a single RESTful API.
Description check ✅ Passed The description addresses the problem and specifies the refactoring type, but lacks detail on implementation scope and breaking changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang added the ci Continue Integration label Apr 21, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 21, 2026 07:35
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. ☯️ refactor Pull request that refactor/refine code 🐖api The modified files are located under directory 'api/apps/sdk' 🧪 test Pull requests that update test cases. labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)

666-748: Add audit logging for the destructive delete flow.

This new endpoint should log the validated delete intent and outcome, without logging full document ID lists. That makes bulk deletes and delete_all traceable in production. As per coding guidelines, **/*.py: Add logging for new flows

🪵 Proposed logging additions
         if delete_all:
             doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)]
 
+        logging.info(
+            "Deleting documents from dataset %s: delete_all=%s, count=%s",
+            dataset_id,
+            delete_all,
+            len(doc_ids),
+        )
+
         # make sure each id is unique
         unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
         if duplicate_messages:
             logging.warning(f"duplicate_messages:{duplicate_messages}")
         else:
@@
         if errors:
+            logging.warning("Document deletion failed for dataset %s: %s", dataset_id, errors)
             return get_error_data_result(message=str(errors))
 
+        logging.info("Deleted %s documents from dataset %s", len(doc_ids), dataset_id)
         return get_result(data={"deleted": len(doc_ids)})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 666 - 748, Add audit
logging to the delete_documents flow to record the validated delete intent and
outcome without printing full document id lists: after the access check and
after preparing doc_ids (and after check_duplicate_ids) emit an audit log (using
logging.info or logging.warning) that includes dataset_id, tenant_id, whether
delete_all is true, and the number of documents to delete (len(doc_ids)) and
note if duplicates were detected (duplicate_messages present) — do this before
calling FileService.delete_docs; then log the deletion outcome after await
thread_pool_exec(FileService.delete_docs, ...) including success/failure, count
deleted (len(doc_ids) on success) and any error summary (do not include full
ids). Reference functions/classes: delete_documents,
KnowledgebaseService.accessible, check_duplicate_ids, DocumentService.query,
FileService.delete_docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 721-740: Reject duplicate IDs and verify ownership before calling
FileService.delete_docs: if check_duplicate_ids(doc_ids, "document") returns any
duplicates, return an error instead of only logging; then preflight all doc_ids
by fetching their records (e.g., via DocumentService.query or a
DocumentService.get_by_ids helper) and ensure each returned document belongs to
the requested dataset_id (and tenant_id if applicable); if any requested id is
missing or belongs to a different dataset/tenant, return an error listing
offending ids; only after duplicates are absent and all ids are verified to
belong to dataset_id call FileService.delete_docs(doc_ids, tenant_id).

In `@web/src/hooks/use-document-request.ts`:
- Around line 319-330: The mutation in useRemoveDocument uses datasetId from
useParams() directly (datasetId!) which can be undefined for routes that supply
the dataset via the knowledgeId || id pattern used elsewhere; change the dataset
resolution to mirror the list/filter flow (derive datasetId using the same logic
as the other hooks — e.g., resolve const datasetId = knowledgeId || id or read
from search params the same way the document list/filter does) and pass that
resolved datasetId into deleteDocument(documentIds) so deleteDocument is never
called with undefined; update the useRemoveDocument function where mutationFn
calls deleteDocument to use the resolved id.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 666-748: Add audit logging to the delete_documents flow to record
the validated delete intent and outcome without printing full document id lists:
after the access check and after preparing doc_ids (and after
check_duplicate_ids) emit an audit log (using logging.info or logging.warning)
that includes dataset_id, tenant_id, whether delete_all is true, and the number
of documents to delete (len(doc_ids)) and note if duplicates were detected
(duplicate_messages present) — do this before calling FileService.delete_docs;
then log the deletion outcome after await
thread_pool_exec(FileService.delete_docs, ...) including success/failure, count
deleted (len(doc_ids) on success) and any error summary (do not include full
ids). Reference functions/classes: delete_documents,
KnowledgebaseService.accessible, check_duplicate_ids, DocumentService.query,
FileService.delete_docs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 88753fc8-625d-4da1-9c5e-c2f4f798b726

📥 Commits

Reviewing files that changed from the base of the PR and between 05c39b9 and 3dc56d6.

📒 Files selected for processing (15)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • api/apps/sdk/doc.py
  • api/utils/validation_utils.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_delete_documents.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • test/testcases/test_sdk_api/test_file_management_within_dataset/test_delete_documents.py
  • test/testcases/test_web_api/test_chunk_app/test_create_chunk.py
  • test/testcases/test_web_api/test_chunk_app/test_update_chunk.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/conftest.py
  • test/testcases/test_web_api/test_document_app/test_rm_documents.py
  • web/src/hooks/use-document-request.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (2)
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • api/apps/document_app.py

Comment on lines +721 to +740
# Get documents to delete
doc_ids = req.get("ids") or []
delete_all = req.get("delete_all", False)
if not delete_all and len(doc_ids) == 0:
return get_error_data_result(message=f"should either provide doc ids or set delete_all(true), dataset: {dataset_id}. ")

if len(doc_ids) > 0 and delete_all:
return get_error_data_result(message=f"should not provide both doc ids and delete_all(true), dataset: {dataset_id}. ")
if delete_all:
doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)]

# make sure each id is unique
unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
if duplicate_messages:
logging.warning(f"duplicate_messages:{duplicate_messages}")
else:
doc_ids = unique_doc_ids

# Delete documents using existing FileService.delete_docs
errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Preflight document IDs before deleting.

This route checks access to dataset_id, but then passes arbitrary doc_ids to FileService.delete_docs, which deletes by document ID and resolves the document tenant internally. That allows a request scoped to one dataset to delete a document from another dataset if its ID is supplied. Also, duplicate IDs are only logged, so duplicate requests can partially delete and then fail.

Reject duplicates and verify every requested ID belongs to dataset_id before calling FileService.delete_docs.

🛡️ Proposed fix
         if delete_all:
             doc_ids = [doc.id for doc in DocumentService.query(kb_id=dataset_id)]
 
         # make sure each id is unique
         unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
         if duplicate_messages:
-            logging.warning(f"duplicate_messages:{duplicate_messages}")
-        else:
-            doc_ids = unique_doc_ids
+            logging.warning(f"duplicate_messages:{duplicate_messages}")
+            return get_error_data_result(message="; ".join(duplicate_messages), code=RetCode.ARGUMENT_ERROR)
+        doc_ids = unique_doc_ids
+
+        dataset_doc_ids = set(KnowledgebaseService.list_documents_by_ids([dataset_id]))
+        missing_doc_ids = [doc_id for doc_id in doc_ids if doc_id not in dataset_doc_ids]
+        if missing_doc_ids:
+            return get_error_data_result(message=f"Document not found: {missing_doc_ids[0]}")
 
         # Delete documents using existing FileService.delete_docs
         errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 721 - 740, Reject
duplicate IDs and verify ownership before calling FileService.delete_docs: if
check_duplicate_ids(doc_ids, "document") returns any duplicates, return an error
instead of only logging; then preflight all doc_ids by fetching their records
(e.g., via DocumentService.query or a DocumentService.get_by_ids helper) and
ensure each returned document belongs to the requested dataset_id (and tenant_id
if applicable); if any requested id is missing or belongs to a different
dataset/tenant, return an error listing offending ids; only after duplicates are
absent and all ids are verified to belong to dataset_id call
FileService.delete_docs(doc_ids, tenant_id).

Comment on lines 319 to +330
export const useRemoveDocument = () => {
const queryClient = useQueryClient();
const { id: datasetId } = useParams();
const {
data,
isPending: loading,
mutateAsync,
} = useMutation({
mutationKey: [DocumentApiAction.RemoveDocument],
mutationFn: async (documentIds: string | string[]) => {
const { data } = await kbService.documentRm({ doc_id: documentIds });
const ids = Array.isArray(documentIds) ? documentIds : [documentIds];
const { data } = await deleteDocument(datasetId!, ids);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the dataset id the same way as document list/filter.

Line 330 uses datasetId! from useParams(), but this file’s list/filter flows use knowledgeId || id. On routes where the dataset id comes from search params, delete calls /datasets/undefined/documents.

🐛 Proposed fix
 export const useRemoveDocument = () => {
   const queryClient = useQueryClient();
-  const { id: datasetId } = useParams();
+  const { knowledgeId } = useGetKnowledgeSearchParams();
+  const { id } = useParams();
   const {
     data,
     isPending: loading,
     mutateAsync,
   } = useMutation({
     mutationKey: [DocumentApiAction.RemoveDocument],
     mutationFn: async (documentIds: string | string[]) => {
       const ids = Array.isArray(documentIds) ? documentIds : [documentIds];
-      const { data } = await deleteDocument(datasetId!, ids);
+      const datasetId = knowledgeId || id;
+      if (!datasetId) {
+        throw new Error('Dataset ID is required');
+      }
+      const { data } = await deleteDocument(datasetId, ids);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const useRemoveDocument = () => {
const queryClient = useQueryClient();
const { id: datasetId } = useParams();
const {
data,
isPending: loading,
mutateAsync,
} = useMutation({
mutationKey: [DocumentApiAction.RemoveDocument],
mutationFn: async (documentIds: string | string[]) => {
const { data } = await kbService.documentRm({ doc_id: documentIds });
const ids = Array.isArray(documentIds) ? documentIds : [documentIds];
const { data } = await deleteDocument(datasetId!, ids);
export const useRemoveDocument = () => {
const queryClient = useQueryClient();
const { knowledgeId } = useGetKnowledgeSearchParams();
const { id } = useParams();
const {
data,
isPending: loading,
mutateAsync,
} = useMutation({
mutationKey: [DocumentApiAction.RemoveDocument],
mutationFn: async (documentIds: string | string[]) => {
const ids = Array.isArray(documentIds) ? documentIds : [documentIds];
const datasetId = knowledgeId || id;
if (!datasetId) {
throw new Error('Dataset ID is required');
}
const { data } = await deleteDocument(datasetId, ids);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/hooks/use-document-request.ts` around lines 319 - 330, The mutation
in useRemoveDocument uses datasetId from useParams() directly (datasetId!) which
can be undefined for routes that supply the dataset via the knowledgeId || id
pattern used elsewhere; change the dataset resolution to mirror the list/filter
flow (derive datasetId using the same logic as the other hooks — e.g., resolve
const datasetId = knowledgeId || id or read from search params the same way the
document list/filter does) and pass that resolved datasetId into
deleteDocument(documentIds) so deleteDocument is never called with undefined;
update the useRemoveDocument function where mutationFn calls deleteDocument to
use the resolved id.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (009e538) to head (47380c9).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14254   +/-   ##
=======================================
  Coverage   96.66%   96.66%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          667      667           
  Misses          8        8           
  Partials       15       15           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)

738-743: ⚠️ Potential issue | 🔴 Critical

Deduplication logic is inverted — duplicates fall through unchanged.

When check_duplicate_ids finds duplicates (duplicate_messages non-empty), the code only logs and leaves doc_ids as the original list with duplicates; the else branch is the only place doc_ids = unique_doc_ids happens, which is the case where there were no duplicates to strip anyway. Result: duplicate IDs are passed to FileService.delete_docs, and the second attempt at the same ID will raise "Document not found!" (see api/db/services/file_service.py L582-625), turning a duplicate-input request into a partial-delete + error response.

This also overlaps with the previously raised ownership/preflight issue, where the suggested fix was to reject duplicates outright rather than silently deduplicate.

🛠️ Minimal fix (always use the deduplicated list)
-        # make sure each id is unique
-        unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
-        if duplicate_messages:
-            logging.warning(f"duplicate_messages:{duplicate_messages}")
-        else:
-            doc_ids = unique_doc_ids
+        # make sure each id is unique
+        unique_doc_ids, duplicate_messages = check_duplicate_ids(doc_ids, "document")
+        if duplicate_messages:
+            logging.warning(f"duplicate_messages:{duplicate_messages}")
+        doc_ids = unique_doc_ids
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 738 - 743, The
deduplication branch is inverted: when check_duplicate_ids returns
duplicate_messages the code logs but keeps the original doc_ids; change this to
reject duplicate input instead of falling through. In the handler where
check_duplicate_ids(doc_ids, "document") is called, if duplicate_messages is
non-empty return a 400/BadRequest indicating duplicate IDs (include
duplicate_messages) and do not call FileService.delete_docs; otherwise set
doc_ids = unique_doc_ids and proceed. Ensure you reference check_duplicate_ids,
duplicate_messages, unique_doc_ids, doc_ids and FileService.delete_docs when
making the change.
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)

735-736: delete_all materializes all documents in the dataset in memory.

For large datasets, [doc.id for doc in DocumentService.query(kb_id=dataset_id)] loads every document row just to extract IDs, then FileService.delete_docs processes them serially. Consider batching (page through IDs) or adding a dedicated DocumentService.list_ids_by_kb projection to reduce memory/DB load, and/or cap the number of docs deletable per request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 735 - 736, The current
delete_all path materializes every Document row via
DocumentService.query(kb_id=dataset_id) which can OOM for large datasets; change
this to stream or page IDs and delete in batches: add a new projection/utility
like DocumentService.list_ids_by_kb(kb_id, page_size) that yields only document
IDs (or modify DocumentService.query to support id-only projection), then call
FileService.delete_docs in fixed-size batches (and enforce an optional
max_delete cap per request). Update the delete_all branch to iterate the
paginated ID generator, collect a batch of IDs, call
FileService.delete_docs(batch), and repeat until exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 746-751: FileService.delete_docs currently swallows partial
successes by concatenating exception messages and returning a non-empty errors
string so the API never reports how many docs were actually deleted; change
FileService.delete_docs to return structured results (e.g., (deleted_ids,
errors) or (deleted_count, errors_list)) and update the callsite in
document_api.py (where thread_pool_exec(FileService.delete_docs, doc_ids,
tenant_id) is invoked) to unpack that tuple, return get_result with the deleted
count when any docs were removed, and when returning get_error_data_result
include both the failed error details and the attempted/failed counts so clients
can reconcile partial failures instead of only receiving a generic error.

---

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 738-743: The deduplication branch is inverted: when
check_duplicate_ids returns duplicate_messages the code logs but keeps the
original doc_ids; change this to reject duplicate input instead of falling
through. In the handler where check_duplicate_ids(doc_ids, "document") is
called, if duplicate_messages is non-empty return a 400/BadRequest indicating
duplicate IDs (include duplicate_messages) and do not call
FileService.delete_docs; otherwise set doc_ids = unique_doc_ids and proceed.
Ensure you reference check_duplicate_ids, duplicate_messages, unique_doc_ids,
doc_ids and FileService.delete_docs when making the change.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 735-736: The current delete_all path materializes every Document
row via DocumentService.query(kb_id=dataset_id) which can OOM for large
datasets; change this to stream or page IDs and delete in batches: add a new
projection/utility like DocumentService.list_ids_by_kb(kb_id, page_size) that
yields only document IDs (or modify DocumentService.query to support id-only
projection), then call FileService.delete_docs in fixed-size batches (and
enforce an optional max_delete cap per request). Update the delete_all branch to
iterate the paginated ID generator, collect a batch of IDs, call
FileService.delete_docs(batch), and repeat until exhausted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c83f74ee-2646-4c20-99f9-104abbbb0c4b

📥 Commits

Reviewing files that changed from the base of the PR and between 3dc56d6 and 47380c9.

📒 Files selected for processing (6)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • web/src/hooks/use-document-request.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (1)
  • api/apps/document_app.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
  • test/testcases/test_web_api/test_common.py
  • web/src/hooks/use-document-request.ts

Comment on lines +746 to +751
errors = await thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id)

if errors:
return get_error_data_result(message=str(errors))

return get_result(data={"deleted": len(doc_ids)})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Partial-failure reporting loses successful deletions.

FileService.delete_docs iterates IDs and concatenates exception messages into a single string without stopping, so on partial failure it has already deleted some docs. Here any non-empty errors causes a generic error response and len(doc_ids) is never returned, so the caller cannot tell how many documents were actually removed. Consider (a) having delete_docs return (deleted_ids, errors) or (b) at minimum, including the attempted/failed counts in the error payload so clients can reconcile state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 746 - 751,
FileService.delete_docs currently swallows partial successes by concatenating
exception messages and returning a non-empty errors string so the API never
reports how many docs were actually deleted; change FileService.delete_docs to
return structured results (e.g., (deleted_ids, errors) or (deleted_count,
errors_list)) and update the callsite in document_api.py (where
thread_pool_exec(FileService.delete_docs, doc_ids, tenant_id) is invoked) to
unpack that tuple, return get_result with the deleted count when any docs were
removed, and when returning get_error_data_result include both the failed error
details and the attempted/failed counts so clients can reconcile partial
failures instead of only receiving a generic error.

@xugangqiang xugangqiang self-assigned this Apr 22, 2026
@JinHai-CN JinHai-CN merged commit 3d8a82c into infiniflow:main Apr 22, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' ci Continue Integration ☯️ refactor Pull request that refactor/refine code size:L This PR changes 100-499 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants