Skip to content

Refactor: Consolidation WEB API & HTTP API for document list_docs#14176

Merged
yingfeng merged 9 commits intoinfiniflow:mainfrom
xugangqiang:merge-doc-list
Apr 20, 2026
Merged

Refactor: Consolidation WEB API & HTTP API for document list_docs#14176
yingfeng merged 9 commits intoinfiniflow:mainfrom
xugangqiang:merge-doc-list

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 17, 2026

What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/list
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents

Type of change

  • Refactoring

@xugangqiang xugangqiang self-assigned this Apr 17, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a RESTful GET documents endpoint at /api/v1/datasets/<dataset_id>/documents, migrates clients and frontend to use dataset_id (was kb_id), replaces numeric run codes with semantic strings (e.g., "DONE"), and updates admin client polling/metadata logic and many tests to the new API and status values.

Changes

Cohort / File(s) Summary
Server: New GET listing & helpers
api/apps/restful_apis/document_api.py
Adds GET /datasets/<dataset_id>/documents with ownership checks, pagination/sort/keyword parsing, suffix/types/run validation and mapping, metadata-derived doc-id filtering helper, and response post-processing (key mapping, thumbnail prefixing, parser metadata → JSON-schema).
Server: Removed legacy POST route
api/apps/document_app.py
Removes legacy POST /list list_docs route and its metadata/filter logic and imports.
Server SDK route removal
api/apps/sdk/doc.py
Removes duplicate SDK list_docs handler for datasets; logic consolidated into the new RESTful handler.
Admin client & polling
admin/client/ragflow_client.py
set_metadata now reads dataset_id from fetched document info; _list_documents uses GET /datasets/<dataset_id>/documents (use_api_base=True); _wait_parse_done treats doc["run"] == "DONE" as parsed completion.
Frontend API & types
web/src/utils/api.ts, web/src/services/knowledge-service.ts, web/src/interfaces/database/document.ts
getDocumentList now a function (datasetId) => /datasets/${datasetId}/documents; frontend listDocument switches from POST to GET and validates params.id; IDocumentInfo.kb_iddataset_id.
Frontend usage & enums
web/src/constants/knowledge.ts, web/src/pages/.../use-dataset-table-columns.tsx, web/src/pages/.../use-rename-document.ts
RunningStatus enum values changed from numeric strings to semantic names (UNSTART,RUNNING,CANCEL,DONE,FAIL,SCHEDULE); UI hooks updated to read/write dataset_id.
Service docs / whitespace
api/apps/services/document_api_service.py
Docstring clarified for _process_run_mapping; only comment/whitespace edits.
Tests: migrate and adapt
test/.../test_web_api/*, test/.../test_http_api/*, test/.../test_sdk_api/*
Multiple test updates: switch listing calls to GET and use kb_iddataset_id, update run-status expectations ("3""DONE", "2""CANCEL"), modify pagination/param expectations, remove legacy unit tests, and add/adjust validation and create-time tests.
Test minor whitespace
test/testcases/test_sdk_api/.../conftest.py
Whitespace-only edit in a fixture.

Sequence Diagram(s)

sequenceDiagram
participant Client as Client (frontend / admin)
participant Gateway as API Gateway / Router
participant DocAPI as Document API
participant DocSvc as DocumentService
participant MetaSvc as DocMetadataService
participant DB as Database

Client->>Gateway: GET /api/v1/datasets/{dataset_id}/documents?{filters}
Gateway->>DocAPI: route to list_docs(dataset_id, params)
DocAPI->>DocSvc: verify ownership & fetch documents(filters, doc_ids_filter)
alt metadata condition present
DocAPI->>MetaSvc: compute allowed doc IDs from metadata
MetaSvc-->>DocAPI: doc IDs
end
DocSvc-->>DocAPI: documents list
DocAPI->>DB: (optional) apply in-memory time-range filtering
DocAPI-->>Gateway: JSON response (mapped keys, thumbnails, run statuses)
Gateway-->>Client: HTTP 200 with documents
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

☯️ refactor, 🐖api, 🧪 test

Suggested reviewers

  • JinHai-CN
  • yuzhichang
  • yingfeng

Poem

🐰 I hopped from POST to GET with cheer,

dataset_id replaced the old kb_id here.
DONE now rings where "3" once sang,
Tests and UI learned a restful twang.
A rabbit munches code and gives a happy cheer.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description addresses the problem and change type but lacks detail on implementation, affected endpoints, and related changes throughout the codebase. Expand description to explain which files/components were modified, key implementation details, breaking changes (if any), and how clients should migrate to the new consolidated endpoint.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective of consolidating two document-listing endpoints into a single RESTful API.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang added the ci Continue Integration label Apr 17, 2026
@xugangqiang xugangqiang changed the title Consolidation WEB API & HTTP API for document list_docs Refactor: Consolidation WEB API & HTTP API for document list_docs Apr 17, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 17, 2026 02:45
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/apps/services/document_api_service.py (1)

252-264: ⚠️ Potential issue | 🟠 Major

Missing SCHEDULE mapping causes incorrect run status normalization.

At Line 261, unknown statuses default to "0". Since "5" is a valid task status, scheduled docs are currently downgraded to UNSTART in API responses.

🔧 Proposed fix
 def _process_run_mapping(doc, run_status):
@@
     run_mapping = {
         "0": "UNSTART",
         "1": "RUNNING",
         "2": "CANCEL",
         "3": "DONE",
         "4": "FAIL",
+        "5": "SCHEDULE",
     }
@@
-    if run_status is None or run_status not in run_mapping.keys():
+    if run_status is None:
+        run_status = "0"
+    else:
+        run_status = str(run_status)
+    if run_status not in run_mapping:
         run_status = "0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/services/document_api_service.py` around lines 252 - 264, The run
status normalization is missing the "SCHEDULE" mapping so run_mapping should
include the "5": "SCHEDULE" entry; update run_mapping in document_api_service.py
to add "5": "SCHEDULE" (keeping existing keys), ensure the lookup logic that
checks run_status (the run_status variable and the doc["run"] assignment)
continues to validate against run_mapping keys so scheduled tasks map to
"SCHEDULE" instead of falling back to "0".
test/testcases/test_web_api/test_kb_app/test_kb_pipeline_tasks.py (1)

33-33: ⚠️ Potential issue | 🟠 Major

Polling completion still uses legacy "3" status.

TASK_STATUS_DONE at Line 33 is numeric, but doc["run"] is now semantic ("DONE"). This can keep _wait_for_docs_parsed waiting until timeout.

🔧 Proposed fix
-TASK_STATUS_DONE = "3"
+TASK_STATUS_DONE = "DONE"

Also applies to: 85-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_kb_app/test_kb_pipeline_tasks.py` at line
33, The test uses a legacy numeric status constant TASK_STATUS_DONE ("3") while
the code now sets doc["run"] to semantic strings like "DONE", which causes
_wait_for_docs_parsed to never detect completion; update TASK_STATUS_DONE to
"DONE" (and any other occurrences, e.g., the second usage) or change the
wait/assert logic in _wait_for_docs_parsed to compare against the semantic
string values so the test and function look for "DONE" rather than "3".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@admin/client/ragflow_client.py`:
- Around line 1767-1773: The synchronous parse path breaks because parse_dataset
relies on _wait_parse_done checking doc["run"] == "3" while the new GET
/api/v1/datasets/<dataset_id>/documents endpoint returns semantic states like
"DONE"; update _wait_parse_done (and any callers in parse_dataset) to detect the
new state representation (e.g., check doc["state"] == "DONE" or map the returned
state strings to the old run codes) and return immediately when the semantic
success state is seen, ensuring the synchronous parse no longer waits the full
timeout.
- Around line 1767-1773: The request path in the _list_documents call is
incorrectly double-prefixed with /api/v1; update the path passed to
self.http_client.request (the GET call that currently uses
f"/api/v1/datasets/{dataset_id}/documents") to be relative to the API root (use
f"/datasets/{dataset_id}/documents") so HttpClient.build_url with
use_api_base=True produces the correct URL; verify the change in
ragflow_client.py where self.http_client.request is invoked and ensure other
calls follow the same pattern.

In `@api/apps/restful_apis/document_api.py`:
- Around line 474-491: The current code applies create_time_from/create_time_to
after DocumentService.get_by_kb_id (which already paginates), causing total to
be incorrect and dropping matches on later pages; also query docs state 0 should
disable a bound but the code treats -1 as disabled. Fix by interpreting 0 as “no
bound” (e.g., treat incoming create_time_from/to == 0 as unset) and move the
create-time filtering into the pagination step: either add
create_time_from/create_time_to parameters to DocumentService.get_by_kb_id and
let that method apply the time window before computing total and paging, or (if
you cannot change the service) call DocumentService.get_by_kb_id without paging
to get the full candidate set, apply the time filter (using
create_time_from/create_time_to semantics where 0 = no bound), then compute
total and slice for page/page_size before mapping via map_doc_keys and the
IMG_BASE64_PREFIX/turn2jsonschema thumbnail/metadata adjustments and returning
via get_result.
- Around line 325-550: Add high-signal logging in list_docs to aid diagnosis:
log the incoming request context (dataset_id, tenant_id, relevant query params
like
id/name/page/page_size/orderby/desc/suffix/types/run/create_time_from/create_time_to)
at the start of list_docs; log an auth/access denial when
KnowledgebaseService.accessible(...) fails; log when DocumentService.query(...)
denies document/name ownership; log invalid filter details when checking
VALID_FILE_TYPES and when converting run_status (include invalid_types or
invalid_status values); log when _parse_doc_id_filter_with_metadata(req, kb_id)
returns an early error/empty_result_or_error (include
metadata_condition/metadata summary); and finally log the result summary just
before return (total and number of docs returned from
DocumentService.get_by_kb_id). Use the module’s logger (e.g., logger) and
appropriate levels (info/warn/error) and include the function name list_docs and
service names in messages.
- Around line 498-511: The code currently calls json.loads directly on
req.get("metadata_condition") and req.get("metadata") which raises on malformed
JSON and causes a server error; wrap each json.loads for metadata_condition and
metadata in try/except catching json.JSONDecodeError (or ValueError for
compatibility) and on parse failure return (None,
get_data_error_result(message="metadata_condition must be a valid JSON object.")
, return_empty_metadata) or similarly for metadata with message "metadata must
be a valid JSON object."; preserve the existing empty_metadata handling (check
for "empty_metadata" key on the parsed dict and set return_empty_metadata/clear
metadata accordingly) so flow after parsing remains unchanged.
- Around line 456-460: The code is reading the query param "types" with
q.get("types", []) which returns a scalar string and causes iteration over
characters; change it to q.getlist("types") so "types" becomes a list (like
suffix and run_status), then keep the existing validation logic using
VALID_FILE_TYPES and get_data_error_result (invalid_types = {t for t in types if
t not in VALID_FILE_TYPES}) so repeated params like ?types=pdf&types=docx are
handled correctly and single values aren't split into characters.

In `@test/testcases/test_web_api/test_common.py`:
- Around line 376-382: The helper list_documents builds a URL using kb_id but
doesn't validate it, causing silent requests to "/datasets/None/documents";
update list_documents to explicitly validate and fail fast: check params (and
any alias like "id") to extract kb_id, and if kb_id is missing or falsy raise a
clear exception (e.g., ValueError("kb_id is required for list_documents"))
before composing the URL so callers with params=None or legacy {"id":...} fail
loudly; reference the kb_id local variable and the list_documents function when
making the change.

In `@test/testcases/test_web_api/test_document_app/test_list_documents.py`:
- Around line 184-195: The new tests test_missing_kb_id and
test_unauthorized_dataset conflict with the earlier test_invalid_dataset_id
expectations for list_documents; reconcile them by choosing the canonical
contract and updating the failing tests accordingly: either change
test_missing_kb_id to assert the empty-kb_id behavior matched in
test_invalid_dataset_id (code 101 and message 'Lack of "KB ID"') or update
test_invalid_dataset_id to expect 100/405 for empty kb_id, and similarly align
the invalid/non-existent kb_id cases to use either code 103 or 102 and the
matching message; update the assertions in the tests (test_missing_kb_id,
test_unauthorized_dataset, test_invalid_dataset_id) so they all assert the same
codes and message strings for the same inputs to list_documents.

In `@test/testcases/test_web_api/test_document_app/test_paser_documents.py`:
- Around line 153-156: The polling loops that call list_documents(_auth,
{"kb_id": _kb_id, "page_size": _document_num}) still check for the old numeric
run state (doc["run"] == "3"); update these checks in the polling helpers used
by test_parse_100_files and test_concurrent_parse to compare against the new
string state "DONE" (i.e., change any doc["run"] == "3" to doc["run"] ==
"DONE"), and ensure both occurrences (the block starting at the shown lines and
the similar block at lines 174-177) are updated so the tests wait for the REST
response's "DONE" run state.

In `@web/src/services/knowledge-service.ts`:
- Around line 244-253: listDocument drops other query params from params when
switching to GET: only params.id is used for the path and body is sent as the
query, losing page/page_size/orderby/desc/keywords. Fix listDocument by
extracting params.id for api.getDocumentList(params.id) but merge the remaining
params (all keys except id) with body into the GET query object (e.g., const
query = { ...paramsWithoutId, ...body }) so existing pagination/sort/search
params are preserved when calling request.get; update any typing if needed to
accept merged query.

---

Outside diff comments:
In `@api/apps/services/document_api_service.py`:
- Around line 252-264: The run status normalization is missing the "SCHEDULE"
mapping so run_mapping should include the "5": "SCHEDULE" entry; update
run_mapping in document_api_service.py to add "5": "SCHEDULE" (keeping existing
keys), ensure the lookup logic that checks run_status (the run_status variable
and the doc["run"] assignment) continues to validate against run_mapping keys so
scheduled tasks map to "SCHEDULE" instead of falling back to "0".

In `@test/testcases/test_web_api/test_kb_app/test_kb_pipeline_tasks.py`:
- Line 33: The test uses a legacy numeric status constant TASK_STATUS_DONE ("3")
while the code now sets doc["run"] to semantic strings like "DONE", which causes
_wait_for_docs_parsed to never detect completion; update TASK_STATUS_DONE to
"DONE" (and any other occurrences, e.g., the second usage) or change the
wait/assert logic in _wait_for_docs_parsed to compare against the semantic
string values so the test and function look for "DONE" rather than "3".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 527d808f-3885-42b0-9052-fd2f2203981e

📥 Commits

Reviewing files that changed from the base of the PR and between f194a09 and 375edba.

📒 Files selected for processing (22)
  • admin/client/ragflow_client.py
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • api/apps/sdk/doc.py
  • api/apps/services/document_api_service.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py
  • test/testcases/test_web_api/conftest.py
  • test/testcases/test_web_api/test_chunk_app/conftest.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/conftest.py
  • test/testcases/test_web_api/test_document_app/test_create_document.py
  • test/testcases/test_web_api/test_document_app/test_list_documents.py
  • test/testcases/test_web_api/test_document_app/test_paser_documents.py
  • test/testcases/test_web_api/test_document_app/test_rm_documents.py
  • test/testcases/test_web_api/test_kb_app/test_kb_pipeline_tasks.py
  • web/src/constants/knowledge.ts
  • web/src/interfaces/database/document.ts
  • web/src/pages/dataset/dataset/use-dataset-table-columns.tsx
  • web/src/pages/dataset/dataset/use-rename-document.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (3)
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • api/apps/document_app.py
  • api/apps/sdk/doc.py

Comment thread admin/client/ragflow_client.py
Comment thread api/apps/restful_apis/document_api.py Outdated
Comment on lines +325 to +550
def list_docs(dataset_id, tenant_id):
"""
List documents in a dataset.
---
tags:
- Documents
security:
- ApiKeyAuth: []
parameters:
- in: path
name: dataset_id
type: string
required: true
description: ID of the dataset.
- in: query
name: id
type: string
required: false
description: Filter by document ID.
- in: query
name: page
type: integer
required: false
default: 1
description: Page number.
- in: query
name: page_size
type: integer
required: false
default: 30
description: Number of items per page.
- in: query
name: orderby
type: string
required: false
default: "create_time"
description: Field to order by.
- in: query
name: desc
type: boolean
required: false
default: true
description: Order in descending.
- in: query
name: create_time_from
type: integer
required: false
default: 0
description: Unix timestamp for filtering documents created after this time. 0 means no filter.
- in: query
name: create_time_to
type: integer
required: false
default: 0
description: Unix timestamp for filtering documents created before this time. 0 means no filter.
- in: query
name: suffix
type: array
items:
type: string
required: false
description: Filter by file suffix (e.g., ["pdf", "txt", "docx"]).
- in: query
name: run
type: array
items:
type: string
required: false
description: Filter by document run status. Supports both numeric ("0", "1", "2", "3", "4") and text formats ("UNSTART", "RUNNING", "CANCEL", "DONE", "FAIL").
- in: header
name: Authorization
type: string
required: true
description: Bearer token for authentication.
responses:
200:
description: List of documents.
schema:
type: object
properties:
total:
type: integer
description: Total number of documents.
docs:
type: array
items:
type: object
properties:
id:
type: string
description: Document ID.
name:
type: string
description: Document name.
chunk_count:
type: integer
description: Number of chunks.
token_count:
type: integer
description: Number of tokens.
dataset_id:
type: string
description: ID of the dataset.
chunk_method:
type: string
description: Chunking method used.
run:
type: string
description: Processing status.
"""
if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id):
return get_error_data_result(message=f"You don't own the dataset {dataset_id}. ")

q = request.args
document_id = q.get("id")
name = q.get("name")

if document_id and not DocumentService.query(id=document_id, kb_id=dataset_id):
return get_error_data_result(message=f"You don't own the document {document_id}.")
if name and not DocumentService.query(name=name, kb_id=dataset_id):
return get_error_data_result(message=f"You don't own the document {name}.")

page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
orderby = q.get("orderby", "create_time")
desc = str(q.get("desc", "true")).strip().lower() != "false"
keywords = q.get("keywords", "")

# filters - align with OpenAPI parameter names
suffix = q.getlist("suffix")

types = q.get("types", [])
if types:
invalid_types = {t for t in types if t not in VALID_FILE_TYPES}
if invalid_types:
return get_data_error_result(message=f"Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}")

# map run status (text or numeric) - align with API parameter
run_status = q.getlist("run")
run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status]
if run_status_converted:
invalid_status = {s for s in run_status_converted if s not in run_status_text_to_numeric.values()}
if invalid_status:
return get_data_error_result(message=f"Invalid filter run status conditions: {', '.join(invalid_status)}")

doc_ids_filter, empty_result_or_error, return_empty_metadata = _parse_doc_id_filter_with_metadata(q, dataset_id)
if empty_result_or_error is not None:
return empty_result_or_error
docs, total = DocumentService.get_by_kb_id(dataset_id, page, page_size, orderby, desc, keywords, run_status_converted, types, suffix, doc_ids_filter, return_empty_metadata=return_empty_metadata)

# time range filter (0 means no bound)
create_time_from = int(q.get("create_time_from", -1))
create_time_to = int(q.get("create_time_to", -1))
if create_time_from >=0 or create_time_to >= 0:
docs = [d for d in docs if (create_time_from == -1 or d.get("create_time", 0) >= create_time_from) and (create_time_to == -1 or d.get("create_time", 0) <= create_time_to)]

renamed_doc_list = [map_doc_keys(doc) for doc in docs]
for doc_item in renamed_doc_list:
if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
if doc_item.get("source_type"):
doc_item["source_type"] = doc_item["source_type"].split("/")[0]
if doc_item["parser_config"].get("metadata"):
doc_item["parser_config"]["metadata"] = turn2jsonschema(doc_item["parser_config"]["metadata"])

return get_result(data={"total": total, "docs": renamed_doc_list})

def _parse_doc_id_filter_with_metadata(req, kb_id):
return_empty_metadata = req.get("return_empty_metadata", False)
if isinstance(return_empty_metadata, str):
return_empty_metadata = return_empty_metadata.lower() == "true"

import json
metadata_condition = json.loads(req.get("metadata_condition", "{}"))
metadata = json.loads(req.get("metadata", "{}"))
if isinstance(metadata, dict) and metadata.get("empty_metadata"):
return_empty_metadata = True
metadata = {k: v for k, v in metadata.items() if k != "empty_metadata"}
if return_empty_metadata:
metadata_condition = {}
metadata = {}
else:
if metadata_condition and not isinstance(metadata_condition, dict):
return None, get_data_error_result(message="metadata_condition must be an object."), return_empty_metadata
if metadata and not isinstance(metadata, dict):
return None, get_data_error_result(message="metadata must be an object."), return_empty_metadata

doc_ids_filter = None
metas = None
if metadata_condition or metadata:
metas = DocMetadataService.get_flatted_meta_by_kbs([kb_id])

if metadata_condition:
doc_ids_filter = set(meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and")))
if metadata_condition.get("conditions") and not doc_ids_filter:
return None, get_json_result(data={"total": 0, "docs": []}), return_empty_metadata

if metadata:
metadata_doc_ids = None
for key, values in metadata.items():
if not values:
continue
if not isinstance(values, list):
values = [values]
values = [str(v) for v in values if v is not None and str(v).strip()]
if not values:
continue
key_doc_ids = set()
for value in values:
key_doc_ids.update(metas.get(key, {}).get(value, []))
if metadata_doc_ids is None:
metadata_doc_ids = key_doc_ids
else:
metadata_doc_ids &= key_doc_ids
if not metadata_doc_ids:
return None, get_json_result(data={"total": 0, "docs": []}), return_empty_metadata
if metadata_doc_ids is not None:
if doc_ids_filter is None:
doc_ids_filter = metadata_doc_ids
else:
doc_ids_filter &= metadata_doc_ids
if not doc_ids_filter:
return None, get_json_result(data={"total": 0, "docs": []}), return_empty_metadata

return list(doc_ids_filter) if doc_ids_filter is not None else list(), None, return_empty_metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add logging around this new list-documents flow.

This endpoint adds multiple validation and early-return branches, but none of them emit logs. Please add high-signal request/error logging here so filter failures and authorization issues are diagnosable in production.

As per coding guidelines, "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 325 - 550, Add
high-signal logging in list_docs to aid diagnosis: log the incoming request
context (dataset_id, tenant_id, relevant query params like
id/name/page/page_size/orderby/desc/suffix/types/run/create_time_from/create_time_to)
at the start of list_docs; log an auth/access denial when
KnowledgebaseService.accessible(...) fails; log when DocumentService.query(...)
denies document/name ownership; log invalid filter details when checking
VALID_FILE_TYPES and when converting run_status (include invalid_types or
invalid_status values); log when _parse_doc_id_filter_with_metadata(req, kb_id)
returns an early error/empty_result_or_error (include
metadata_condition/metadata summary); and finally log the result summary just
before return (total and number of docs returned from
DocumentService.get_by_kb_id). Use the module’s logger (e.g., logger) and
appropriate levels (info/warn/error) and include the function name list_docs and
service names in messages.

Comment thread api/apps/restful_apis/document_api.py Outdated
Comment thread api/apps/restful_apis/document_api.py Outdated
Comment on lines +474 to +491
docs, total = DocumentService.get_by_kb_id(dataset_id, page, page_size, orderby, desc, keywords, run_status_converted, types, suffix, doc_ids_filter, return_empty_metadata=return_empty_metadata)

# time range filter (0 means no bound)
create_time_from = int(q.get("create_time_from", -1))
create_time_to = int(q.get("create_time_to", -1))
if create_time_from >=0 or create_time_to >= 0:
docs = [d for d in docs if (create_time_from == -1 or d.get("create_time", 0) >= create_time_from) and (create_time_to == -1 or d.get("create_time", 0) <= create_time_to)]

renamed_doc_list = [map_doc_keys(doc) for doc in docs]
for doc_item in renamed_doc_list:
if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
if doc_item.get("source_type"):
doc_item["source_type"] = doc_item["source_type"].split("/")[0]
if doc_item["parser_config"].get("metadata"):
doc_item["parser_config"]["metadata"] = turn2jsonschema(doc_item["parser_config"]["metadata"])

return get_result(data={"total": total, "docs": renamed_doc_list})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the create-time window before pagination and total calculation.

DocumentService.get_by_kb_id(...) pages first, then this block filters in memory. That makes data.total report the unfiltered count and can drop matching documents that live on later pages. It also treats create_time_to=0 as “before epoch”, even though the route docs say 0 disables the bound.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 474 - 491, The current
code applies create_time_from/create_time_to after DocumentService.get_by_kb_id
(which already paginates), causing total to be incorrect and dropping matches on
later pages; also query docs state 0 should disable a bound but the code treats
-1 as disabled. Fix by interpreting 0 as “no bound” (e.g., treat incoming
create_time_from/to == 0 as unset) and move the create-time filtering into the
pagination step: either add create_time_from/create_time_to parameters to
DocumentService.get_by_kb_id and let that method apply the time window before
computing total and paging, or (if you cannot change the service) call
DocumentService.get_by_kb_id without paging to get the full candidate set, apply
the time filter (using create_time_from/create_time_to semantics where 0 = no
bound), then compute total and slice for page/page_size before mapping via
map_doc_keys and the IMG_BASE64_PREFIX/turn2jsonschema thumbnail/metadata
adjustments and returning via get_result.

Comment thread api/apps/restful_apis/document_api.py Outdated
Comment on lines 376 to 382
def list_documents(auth, params=None, payload=None, *, headers=HEADERS, data=None):
kb_id = params.get("kb_id") if params else None
url = f"{HOST_ADDRESS}/api/{VERSION}/datasets/{kb_id}/documents"
if payload is None:
payload = {}
res = requests.post(url=f"{HOST_ADDRESS}{DOCUMENT_APP_URL}/list", headers=headers, auth=auth, params=params, json=payload, data=data)
res = requests.get(url=url, headers=headers, auth=auth, params=params, json=payload, data=data)
return res.json()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e

# Find all test call sites that still need inspection after the helper contract change.
rg -n -C2 --glob 'test/**/*.py' 'list_documents\s*\('
rg -n -C2 --glob 'test/**/*.py' '"id"\s*:'
rg -n -C2 --glob 'test/**/*.py' '"kb_id"\s*:'

Repository: infiniflow/ragflow

Length of output: 50375


🏁 Script executed:

# Check the current implementation to see if logging should be added
grep -A 10 "def list_documents" test/testcases/test_web_api/test_common.py | head -15

Repository: infiniflow/ragflow

Length of output: 628


Add validation to fail fast when kb_id is missing.

The helper always bakes kb_id into the URL path. With params=None or unconverted callers still sending {"id": ...} instead of {"kb_id": ...}, this silently creates /datasets/None/documents, obscuring the real migration bug.

Example of active legacy call site: test/testcases/test_web_api/test_document_app/test_list_documents.py:35 uses {"id": "dataset_id"} which will break.

Suggested fix
 def list_documents(auth, params=None, payload=None, *, headers=HEADERS, data=None):
     kb_id = params.get("kb_id") if params else None
+    if not kb_id:
+        raise ValueError("params['kb_id'] is required for list_documents")
     url = f"{HOST_ADDRESS}/api/{VERSION}/datasets/{kb_id}/documents"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_common.py` around lines 376 - 382, The
helper list_documents builds a URL using kb_id but doesn't validate it, causing
silent requests to "/datasets/None/documents"; update list_documents to
explicitly validate and fail fast: check params (and any alias like "id") to
extract kb_id, and if kb_id is missing or falsy raise a clear exception (e.g.,
ValueError("kb_id is required for list_documents")) before composing the URL so
callers with params=None or legacy {"id":...} fail loudly; reference the kb_id
local variable and the list_documents function when making the change.

Comment on lines +244 to +253
// Using RESTful API: GET /api/v1/datasets/{dataset_id}/documents
export const listDocument = (
params?: IFetchKnowledgeListRequestParams,
body?: IFetchDocumentListRequestBody,
) => request.post(api.getDocumentList, { data: body || {}, params });
) => {
if (!params || !params.id) {
throw new Error('params and params.id are required');
}
return request.get(api.getDocumentList(params.id), { params: body });
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the non-id query params when moving to GET.

This only uses params.id for the path and then sends body as the query string. Any existing page, page_size, orderby, desc, keywords, etc. carried in params are now silently dropped, so the document table can fetch the wrong slice.

Suggested fix
 export const listDocument = (
   params?: IFetchKnowledgeListRequestParams,
   body?: IFetchDocumentListRequestBody,
 ) => {
   if (!params || !params.id) {
     throw new Error('params and params.id are required');
   }
-  return request.get(api.getDocumentList(params.id), { params: body });
+  const { id: datasetId, ...queryParams } = params;
+  return request.get(api.getDocumentList(datasetId), {
+    params: { ...queryParams, ...body },
+  });
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/services/knowledge-service.ts` around lines 244 - 253, listDocument
drops other query params from params when switching to GET: only params.id is
used for the path and body is sent as the query, losing
page/page_size/orderby/desc/keywords. Fix listDocument by extracting params.id
for api.getDocumentList(params.id) but merge the remaining params (all keys
except id) with body into the GET query object (e.g., const query = {
...paramsWithoutId, ...body }) so existing pagination/sort/search params are
preserved when calling request.get; update any typing if needed to accept merged
query.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/testcases/test_sdk_api/test_file_management_within_dataset/test_list_documents.py (2)

21-177: ⚠️ Potential issue | 🟠 Major

Please restore id/name filter test coverage for list_documents.

Coverage for filtering by id, name, and combined filters appears removed, but those parameters are still supported by both SDK and backend handlers. This leaves a meaningful regression gap for a core query path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_sdk_api/test_file_management_within_dataset/test_list_documents.py`
around lines 21 - 177, The test suite removed coverage for filtering by document
id and name in TestDocumentsList—restore tests that exercise
dataset.list_documents with the id, name, and combined id+name filters
(including cases: exact id, exact name, no-match id/name) to ensure SDK/backend
handlers are exercised; add new parametrized test methods (or extend existing
test_keywords/test_page variants) that call dataset.list_documents(id=<value>),
dataset.list_documents(name=<value>), and dataset.list_documents(id=<value>,
name=<value>) and assert the returned list length and contents (use existing
add_documents fixture to provide known ids/names and assert expected results and
expected empty results for unknown values).

62-83: ⚠️ Potential issue | 🟡 Minor

Please keep explicit boundary coverage for page_size=0 (and similarly page=0).

The new valid case on Line 66 is fine, but replacing zero-input scenarios removes a useful contract test for invalid boundaries (ge=1). Keeping explicit 0 cases helps prevent silent regression in validation behavior.

Suggested patch
@@
         [
             ({"page": None, "page_size": 2}, 2, "not instance of"),
+            ({"page": 0, "page_size": 2}, 0, "greater than or equal to 1"),
             ({"page": 1, "page_size": 2}, 2, ""),
             ({"page": 2, "page_size": 2}, 2, ""),
@@
         [
             ({"page_size": None}, 5, "not instance of"),
+            ({"page_size": 0}, 0, "greater than or equal to 1"),
             ({"page_size": 2}, 2, ""),
             ({"page_size": 1}, 1, ""),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_sdk_api/test_file_management_within_dataset/test_list_documents.py`
around lines 62 - 83, The test removed explicit boundary checks for zero;
restore explicit invalid-zero cases in the pytest parametrize table in
test_list_documents.py so we assert validation for page_size=0 (and similarly
add/keep a test for page=0) rather than relying on nearby cases: add a param
entry with {"page_size": 0} expecting the error/message used for zero bounds
(e.g., "Invalid page size") and ensure any test that covers "page" also includes
{"page": 0} expecting the same invalid-bound behavior; locate the parametrize
block that defines "params, expected_page_size, expected_message" and add these
explicit zero-bound params so the ge=1 contract is preserved.
♻️ Duplicate comments (1)
test/testcases/test_web_api/test_document_app/test_paser_documents.py (1)

174-177: ⚠️ Potential issue | 🔴 Critical

Test will timeout: run status still checks for "3" instead of "DONE".

Line 176 still uses the old numeric run state check. The REST response now returns "DONE" as a string, so this condition will never be true and test_concurrent_parse will timeout.

🐛 Proposed fix
     def condition(_auth, _kb_id, _document_num):
         res = list_documents(_auth, {"kb_id": _kb_id, "page_size": _document_num})
         for doc in res["data"]["docs"]:
-            if doc["run"] != "3":
+            if doc["run"] != "DONE":
                 return False
         return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_paser_documents.py` around
lines 174 - 177, The test is checking the old numeric run state ("3") so it
never matches the new REST response which uses the string "DONE"; update the
condition in the loop that inspects list_documents(_auth, {"kb_id": _kb_id,
"page_size": _document_num}) — specifically replace the check doc["run"] != "3"
with a check for doc["run"] != "DONE" (in the test function
test_concurrent_parse / the loop over res["data"]["docs"]) so the test correctly
detects completed runs.
🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (2)

453-453: Clean up debug artifacts in log messages.

The dashes ("------------") appear to be debug artifacts. Consider cleaning up log messages for production quality.

♻️ Example cleanup
-            logging.error(msg=f"------------Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}")
+            logging.warning(f"list_docs: Invalid type filter conditions: {', '.join(invalid_types)}")

Similarly for lines 463, 468, 498, and 503.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` at line 453, Remove the debug dashes
from the log messages and make them clean, human-readable strings: replace
logging.error(msg=f"------------Invalid filter conditions: {',
'.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}") with a
concise message such as logging.error(msg=f"Invalid filter conditions: {',
'.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}"), and
similarly remove the leading "------------" from the other logging calls in the
same function/method that reference invalid_types or related variables (the
logging.error and logging.warning calls shown at the other commented lines).
Ensure the message text remains informative but without debug-artifact prefixes.

326-326: Consider making list_docs an async function for consistency.

Other endpoints in this file (update_document, upload_document, metadata_summary) use async def. While Quart can handle sync handlers, using async def here would be consistent with the codebase pattern.

♻️ Suggested change
-def list_docs(dataset_id, tenant_id):
+async def list_docs(dataset_id, tenant_id):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` at line 326, The handler function
list_docs is currently synchronous while other endpoints use async; change its
signature to async def list_docs(...) and update its internal calls to await any
coroutine functions it calls (e.g., calls to DB, I/O, or helper functions),
ensuring any helpers used by list_docs are awaited or made async if needed, and
adjust any imports or references that expect a coroutine so the Quart routing
remains consistent with the other async endpoints.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Line 449: The call to q.getlist("types", []) is using an invalid second
argument (a default list) — remove the second parameter so it becomes
q.getlist("types"); update the assignment at the location where types is set
(variable name: types, call: q.getlist) to rely on getlist's built-in empty-list
behavior when the key is missing.
- Line 440: Remove the stray debug print statement print(f"page:{page}, page
size:{page_size}") in document_api.py; either delete it or replace it with a
proper logger debug call (e.g., logger.debug(...) or
current_app.logger.debug(...)) using the existing logging mechanism so page and
page_size are logged instead of printed to stdout.

---

Outside diff comments:
In
`@test/testcases/test_sdk_api/test_file_management_within_dataset/test_list_documents.py`:
- Around line 21-177: The test suite removed coverage for filtering by document
id and name in TestDocumentsList—restore tests that exercise
dataset.list_documents with the id, name, and combined id+name filters
(including cases: exact id, exact name, no-match id/name) to ensure SDK/backend
handlers are exercised; add new parametrized test methods (or extend existing
test_keywords/test_page variants) that call dataset.list_documents(id=<value>),
dataset.list_documents(name=<value>), and dataset.list_documents(id=<value>,
name=<value>) and assert the returned list length and contents (use existing
add_documents fixture to provide known ids/names and assert expected results and
expected empty results for unknown values).
- Around line 62-83: The test removed explicit boundary checks for zero; restore
explicit invalid-zero cases in the pytest parametrize table in
test_list_documents.py so we assert validation for page_size=0 (and similarly
add/keep a test for page=0) rather than relying on nearby cases: add a param
entry with {"page_size": 0} expecting the error/message used for zero bounds
(e.g., "Invalid page size") and ensure any test that covers "page" also includes
{"page": 0} expecting the same invalid-bound behavior; locate the parametrize
block that defines "params, expected_page_size, expected_message" and add these
explicit zero-bound params so the ge=1 contract is preserved.

---

Duplicate comments:
In `@test/testcases/test_web_api/test_document_app/test_paser_documents.py`:
- Around line 174-177: The test is checking the old numeric run state ("3") so
it never matches the new REST response which uses the string "DONE"; update the
condition in the loop that inspects list_documents(_auth, {"kb_id": _kb_id,
"page_size": _document_num}) — specifically replace the check doc["run"] != "3"
with a check for doc["run"] != "DONE" (in the test function
test_concurrent_parse / the loop over res["data"]["docs"]) so the test correctly
detects completed runs.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Line 453: Remove the debug dashes from the log messages and make them clean,
human-readable strings: replace logging.error(msg=f"------------Invalid filter
conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else
''}") with a concise message such as logging.error(msg=f"Invalid filter
conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else
''}"), and similarly remove the leading "------------" from the other logging
calls in the same function/method that reference invalid_types or related
variables (the logging.error and logging.warning calls shown at the other
commented lines). Ensure the message text remains informative but without
debug-artifact prefixes.
- Line 326: The handler function list_docs is currently synchronous while other
endpoints use async; change its signature to async def list_docs(...) and update
its internal calls to await any coroutine functions it calls (e.g., calls to DB,
I/O, or helper functions), ensuring any helpers used by list_docs are awaited or
made async if needed, and adjust any imports or references that expect a
coroutine so the Quart routing remains consistent with the other async
endpoints.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 91bc4448-fcfa-413f-a885-37a36d5e9f94

📥 Commits

Reviewing files that changed from the base of the PR and between 375edba and 93454f3.

📒 Files selected for processing (8)
  • admin/client/ragflow_client.py
  • api/apps/restful_apis/document_api.py
  • api/apps/sdk/doc.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py
  • test/testcases/test_sdk_api/test_file_management_within_dataset/conftest.py
  • test/testcases/test_sdk_api/test_file_management_within_dataset/test_list_documents.py
  • test/testcases/test_web_api/test_document_app/test_paser_documents.py
  • web/src/utils/api.ts
💤 Files with no reviewable changes (1)
  • api/apps/sdk/doc.py
✅ Files skipped from review due to trivial changes (1)
  • test/testcases/test_sdk_api/test_file_management_within_dataset/conftest.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • web/src/utils/api.ts
  • admin/client/ragflow_client.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py

Comment thread api/apps/restful_apis/document_api.py Outdated
Comment thread api/apps/restful_apis/document_api.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (2)

453-454: Clean up log message formatting.

The log messages contain debug-style prefix dashes (e.g., "------------Invalid filter...") which appear to be leftover from development. Consider using cleaner, structured log messages for production.

♻️ Suggested cleanup
-            logging.error(msg=f"------------Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}")
+            logging.error(f"list_docs: Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}")

Apply similar cleanup to other log messages at lines 463, 468, 535, and 540.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 453 - 454, Replace the
debug-style prefixed log messages (e.g., the logging.error call that currently
logs "------------Invalid filter conditions: ...") with clean, structured
messages: remove the leading dashes and use a clear message string such as
"Invalid filter conditions: {', '.join(invalid_types)}" (maintain the
pluralization logic already used). Update the corresponding logging.error calls
that mirror this pattern (including the one paired with the return
get_data_error_result(...) and the similar messages at the other occurrences
noted) so they all log the same clean, consistent message while keeping the
existing invalid_types variable and pluralization logic and leaving the
get_data_error_result(...) return content unchanged.

323-326: Inconsistent async definition with other endpoints.

This handler is defined synchronously (def list_docs) while other endpoints in this file (update_document, upload_document) use async def. For consistency and to follow Quart best practices, consider making this async.

♻️ Suggested change
 `@manager.route`("/datasets/<dataset_id>/documents", methods=["GET"])  # noqa: F821
 `@login_required`
 `@add_tenant_id_to_kwargs`
-def list_docs(dataset_id, tenant_id):
+async def list_docs(dataset_id, tenant_id):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 323 - 326, The handler
list_docs is defined synchronously while other handlers like update_document and
upload_document are async; change def list_docs to async def list_docs and
ensure any internal calls that await (e.g., database or I/O calls) are awaited
accordingly, keeping existing decorators (`@manager.route`, `@login_required`,
`@add_tenant_id_to_kwargs`) intact and updating any callers/uses if they expect a
coroutine; verify return values remain compatible with Quart async handlers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 474-475: The integer conversions for query params
create_time_from/create_time_to can raise ValueError; wrap the
int(q.get("create_time_from", 0)) and int(q.get("create_time_to", 0))
conversions in a validation block (e.g., try/except) similar to the pagination
handling, catch ValueError and return a clear 400 Bad Request (or raise the
framework's BadRequest) with a message indicating invalid time range parameters,
or fall back to safe defaults after logging; update the code around the
create_time_from and create_time_to assignments to perform this validation and
error response.
- Around line 480-482: The loop over renamed_doc_list currently accesses
doc_item["thumbnail"] directly which can raise KeyError if map_doc_keys omitted
that key; change to use doc_item.get("thumbnail") for the check (e.g., thumb =
doc_item.get("thumbnail")) and only rewrite doc_item["thumbnail"] when thumb is
truthy and not starting with IMG_BASE64_PREFIX, using dataset_id to construct
the URL; leave the key absent if get() returns None so you don't introduce
incorrect keys.
- Around line 437-438: Parsing page and page_size using int(q.get(...)) can
raise ValueError for non-integer query values and cause a 500; wrap the
conversions for page and page_size (the int(q.get("page", 1)) and
int(q.get("page_size", 30")) calls) in a validation block that catches
ValueError (and optionally TypeError), and return a clear 400-level response
(e.g., raise an HTTPException/return a validation error) with a message
indicating which parameter is invalid and the expected integer type; ensure you
still enforce sensible defaults and bounds after successful parsing.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 453-454: Replace the debug-style prefixed log messages (e.g., the
logging.error call that currently logs "------------Invalid filter conditions:
...") with clean, structured messages: remove the leading dashes and use a clear
message string such as "Invalid filter conditions: {', '.join(invalid_types)}"
(maintain the pluralization logic already used). Update the corresponding
logging.error calls that mirror this pattern (including the one paired with the
return get_data_error_result(...) and the similar messages at the other
occurrences noted) so they all log the same clean, consistent message while
keeping the existing invalid_types variable and pluralization logic and leaving
the get_data_error_result(...) return content unchanged.
- Around line 323-326: The handler list_docs is defined synchronously while
other handlers like update_document and upload_document are async; change def
list_docs to async def list_docs and ensure any internal calls that await (e.g.,
database or I/O calls) are awaited accordingly, keeping existing decorators
(`@manager.route`, `@login_required`, `@add_tenant_id_to_kwargs`) intact and updating
any callers/uses if they expect a coroutine; verify return values remain
compatible with Quart async handlers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a844ca4-3429-401d-9384-44d57f661be5

📥 Commits

Reviewing files that changed from the base of the PR and between 93454f3 and dd56609.

📒 Files selected for processing (2)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_document_app/test_list_documents.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/testcases/test_web_api/test_document_app/test_list_documents.py

Comment on lines +437 to +438
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for invalid integer query parameters.

int(q.get("page", 1)) and int(q.get("page_size", 30)) will raise ValueError if the client provides non-integer values (e.g., ?page=abc). This would result in an unhandled 500 error instead of a user-friendly validation message.

🛡️ Proposed fix
     q = request.args

-    page = int(q.get("page", 1))
-    page_size = int(q.get("page_size", 30))
+    try:
+        page = int(q.get("page", 1))
+        page_size = int(q.get("page_size", 30))
+    except ValueError:
+        return get_data_error_result(message="page and page_size must be integers.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
try:
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
except ValueError:
return get_data_error_result(message="page and page_size must be integers.")
Suggested change
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
q = request.args
try:
page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30))
except ValueError:
return get_data_error_result(message="page and page_size must be integers.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 437 - 438, Parsing page
and page_size using int(q.get(...)) can raise ValueError for non-integer query
values and cause a 500; wrap the conversions for page and page_size (the
int(q.get("page", 1)) and int(q.get("page_size", 30")) calls) in a validation
block that catches ValueError (and optionally TypeError), and return a clear
400-level response (e.g., raise an HTTPException/return a validation error) with
a message indicating which parameter is invalid and the expected integer type;
ensure you still enforce sensible defaults and bounds after successful parsing.

Comment on lines +474 to +475
create_time_from = int(q.get("create_time_from", 0))
create_time_to = int(q.get("create_time_to", 0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for time range parameters.

Similar to pagination, int(q.get("create_time_from", 0)) and int(q.get("create_time_to", 0)) will raise ValueError if non-integer values are provided.

🛡️ Proposed fix
     # time range filter (0 means no bound)
-    create_time_from = int(q.get("create_time_from", 0))
-    create_time_to = int(q.get("create_time_to", 0))
+    try:
+        create_time_from = int(q.get("create_time_from", 0))
+        create_time_to = int(q.get("create_time_to", 0))
+    except ValueError:
+        return get_data_error_result(message="create_time_from and create_time_to must be integers.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 474 - 475, The integer
conversions for query params create_time_from/create_time_to can raise
ValueError; wrap the int(q.get("create_time_from", 0)) and
int(q.get("create_time_to", 0)) conversions in a validation block (e.g.,
try/except) similar to the pagination handling, catch ValueError and return a
clear 400 Bad Request (or raise the framework's BadRequest) with a message
indicating invalid time range parameters, or fall back to safe defaults after
logging; update the code around the create_time_from and create_time_to
assignments to perform this validation and error response.

Comment thread api/apps/restful_apis/document_api.py Outdated
Comment on lines +480 to +482
for doc_item in renamed_doc_list:
if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use .get() to safely access thumbnail key.

If map_doc_keys returns a document without the thumbnail key, doc_item["thumbnail"] will raise a KeyError. Use .get() for defensive access.

🛡️ Proposed fix
     renamed_doc_list = [map_doc_keys(doc) for doc in docs]
     for doc_item in renamed_doc_list:
-        if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
+        thumbnail = doc_item.get("thumbnail")
+        if thumbnail and not thumbnail.startswith(IMG_BASE64_PREFIX):
             doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 480 - 482, The loop over
renamed_doc_list currently accesses doc_item["thumbnail"] directly which can
raise KeyError if map_doc_keys omitted that key; change to use
doc_item.get("thumbnail") for the check (e.g., thumb =
doc_item.get("thumbnail")) and only rewrite doc_item["thumbnail"] when thumb is
truthy and not starting with IMG_BASE64_PREFIX, using dataset_id to construct
the URL; leave the key absent if get() returns None so you don't introduce
incorrect keys.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/testcases/test_web_api/test_document_app/test_list_documents.py (1)

202-218: Make the create-time test cover pagination semantics too.

With only 5 fixture docs and the default page_size of 30, this still passes even if the handler applies create_time_from / create_time_to after pagination. Add a smaller page_size and assert data.total too, otherwise that regression stays invisible.

Based on learnings, "Applies to tests/**/*.py : Add/adjust tests for behavior changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_list_documents.py` around
lines 202 - 218, The test test_create_time_filter currently uses the default
large page_size so it cannot detect if create_time_from/create_time_to are
applied after pagination; update the test to pass a small page_size to
list_documents (e.g., page_size=2) and assert res["data"]["total"] alongside the
docs length to ensure filtering affects the total count; locate calls to
list_documents in test_create_time_filter and add page_size to both the
inclusive and exclusive time-range calls, then assert res["data"]["total"]
reflects expected number of matching documents as well as the docs list lengths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 446-471: The endpoint never reads the query parameter q.get("id"),
so requests like GET /datasets/<dataset_id>/documents?id=<doc_id> fall back to a
full dataset listing; fix by extracting id_param = q.get("id") (and
split/normalize into a list if multiple IDs are allowed) and ensure that
doc_ids_filter is set from that value (or passed into
_parse_doc_id_filter_with_metadata/q call) before calling
DocumentService.get_by_kb_id; update the usage of
_parse_doc_id_filter_with_metadata, doc_ids_filter, and the call to
DocumentService.get_by_kb_id so the explicit id filter from q.get("id") is
honored.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py`:
- Around line 160-166: The test may silently succeed if the target document
isn't on the returned page; when calling list_documents(HttpApiAuth, dataset_id,
{"id": document_ids[0]}) you must explicitly assert the target was found before
comparing meta_fields. Modify the loop over res["data"]["docs"] (and/or after
it) to set a found flag when doc["id"] == document_ids[0], assert found is True
(failing fast if missing), and only then assert doc["meta_fields"] ==
meta_fields; reference variables/functions: list_documents, document_ids,
meta_fields, res["data"]["docs"] to locate and implement the check.

---

Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_list_documents.py`:
- Around line 202-218: The test test_create_time_filter currently uses the
default large page_size so it cannot detect if create_time_from/create_time_to
are applied after pagination; update the test to pass a small page_size to
list_documents (e.g., page_size=2) and assert res["data"]["total"] alongside the
docs length to ensure filtering affects the total count; locate calls to
list_documents in test_create_time_filter and add page_size to both the
inclusive and exclusive time-range calls, then assert res["data"]["total"]
reflects expected number of matching documents as well as the docs list lengths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6191ad80-b5e1-4d89-b3b1-5c605a58ac85

📥 Commits

Reviewing files that changed from the base of the PR and between dd56609 and 25fa825.

📒 Files selected for processing (4)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py
  • test/testcases/test_web_api/test_document_app/test_list_documents.py
✅ Files skipped from review due to trivial changes (1)
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py

Comment thread api/apps/restful_apis/document_api.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py (1)

45-46: ⚠️ Potential issue | 🟠 Major

Document lookup assertions are still incomplete after id/name filter removal.

With list_documents no longer filtering by id/name, these success paths can validate the wrong record or fail unclearly: test_name still uses the first doc (Line 94), test_meta_fields can silently pass when the target is absent, and test_chunk_method/test_parser_config dereference doc_of_id without asserting it was found.

Suggested fix
@@
         if expected_code == 0:
             res = list_documents(HttpApiAuth, dataset_id, {"id": document_ids[0]})
-            assert res["data"]["docs"][0]["name"] == name
+            doc_of_id = next((doc for doc in res["data"]["docs"] if doc["id"] == document_ids[0]), None)
+            assert doc_of_id is not None, res
+            assert doc_of_id["name"] == name
@@
         if expected_code == 0:
             res = list_documents(HttpApiAuth, dataset_id, {"id": document_ids[0]})
-            for doc in res["data"]["docs"]:
-                if doc["id"] != document_ids[0]:
-                    continue
-                else:
-                    assert doc["meta_fields"] == meta_fields
+            doc_of_id = next((doc for doc in res["data"]["docs"] if doc["id"] == document_ids[0]), None)
+            assert doc_of_id is not None, res
+            assert doc_of_id["meta_fields"] == meta_fields
@@
         if expected_code == 0:
             res = list_documents(HttpApiAuth, dataset_id, {"id": document_ids[0]})
             doc_of_id = None
             for doc in res["data"]["docs"]:
                 if doc["id"] == document_ids[0]:
                     doc_of_id = doc
                     break
+            assert doc_of_id is not None, res
             if chunk_method == "":
                 assert doc_of_id["chunk_method"] == "naive"
             else:
                 assert doc_of_id["chunk_method"] == chunk_method
@@
         if expected_code == 0:
             res = list_documents(HttpApiAuth, dataset_id, {"id": document_ids[0]})
 
             doc_of_id = None
             for doc in res["data"]["docs"]:
                 if doc["id"] == document_ids[0]:
                     doc_of_id = doc
                     break
+            assert doc_of_id is not None, res
 
             if parser_config == {}:
                 assert doc_of_id["parser_config"] == DEFAULT_PARSER_CONFIG

Based on learnings, "Applies to tests/**/*.py : Add/adjust tests for behavior changes".

Also applies to: 161-165, 215-223, 612-622

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py`
around lines 45 - 46, Tests rely on list_documents to filter by id/name but
list_documents no longer does that; update the tests (references: test_name,
test_meta_fields, test_chunk_method, test_parser_config) to explicitly find the
target document in the returned list (e.g., search for matching id or name using
the doc_of_id variable or name string), assert that the document is present
before accessing its fields, and replace any implicit use of the “first doc”
with an explicit lookup and clear assertion so failures are deterministic and
informative.
🧹 Nitpick comments (3)
test/testcases/test_sdk_api/test_file_management_within_dataset/test_update_document.py (1)

67-68: Avoid brittle [0] indexing when locating the updated document.

If list_documents() does not include the target doc (filter not applied, pagination window, or timing lag), this throws IndexError and makes failures non-diagnostic.

Proposed fix
-            docs = dataset.list_documents(id=document.id)
-            updated_doc = [doc for doc in docs if doc.id == document.id][0]
+            docs = dataset.list_documents(id=document.id)
+            updated_doc = next((doc for doc in docs if doc.id == document.id), None)
+            assert updated_doc is not None, f"Document {document.id} not found in list_documents() result"
-            docs = dataset.list_documents()
-            updated_doc = [doc for doc in docs if doc.id == document.id][0]
+            docs = dataset.list_documents()
+            updated_doc = next((doc for doc in docs if doc.id == document.id), None)
+            assert updated_doc is not None, f"Document {document.id} not found in list_documents() result"

Also applies to: 142-143, 484-485

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_sdk_api/test_file_management_within_dataset/test_update_document.py`
around lines 67 - 68, The list_documents() result is being indexed with [0]
after filtering which is brittle and can raise IndexError if the target doc
isn't present; update the lookup to use a safe search such as next((d for d in
docs if d.id == document.id), None) and then assert or raise a clear error if
None (e.g., assert updated_doc is not None, f"document {document.id} not found
in list_documents()") so failures are diagnostic; apply the same fix to the
other similar occurrences that filter list_documents() and then index with [0]
(the same pattern around updated_doc/document.id in this test file).
test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py (1)

44-46: Add lightweight debug logging for this new validation flow.

This helper now depends on a dataset-level list call; logging requested vs returned document counts will make flaky/failing CI cases much faster to triage.

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py`
around lines 44 - 46, In validate_document_details, add lightweight debug
logging that reports the requested document count and the returned document
count from the dataset-level list_documents call: create/get a logger (e.g.,
logging.getLogger(__name__)), then after res = list_documents(auth, dataset_id,
params={}) log a debug message showing len(document_ids) and the number of
documents returned (e.g., len(res["documents"]) or the actual collection in
res), and optionally log which requested IDs were missing by computing the set
difference between document_ids and the returned document IDs; reference
validate_document_details, list_documents, res and document_ids when making
these additions.
test/testcases/test_web_api/test_document_app/test_list_documents.py (1)

185-191: Add one happy-path run filter case.

This PR changes the public contract to semantic run statuses, but the new coverage only checks rejection of "INVALID". A broken mapping for accepted values would still pass. Please add at least one success case for a valid status and assert the returned docs all satisfy that filter.

Based on learnings, "Applies to tests/**/*.py : Add/adjust tests for behavior changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_list_documents.py` around
lines 185 - 191, Add a happy-path test next to test_invalid_run_status_filter
that calls list_documents(WebApiAuth, {"kb_id": kb_id, "run": "<VALID_STATUS>"})
(use a valid semantic status used by the API, e.g., "COMPLETED" or "SUCCEEDED")
using the same fixtures (add_documents, WebApiAuth), assert res["code"] == 0 (or
the success code your API returns), and assert that every returned document in
res["data"] has a run/status field equal to the requested status (or otherwise
satisfies the filter). Ensure the new test uses the same kb_id from
add_documents and matches the response shape checked by the other tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py`:
- Around line 46-54: The test iterates expected document_ids but only asserts
fields for IDs that are present in the single page returned by
list_documents(auth, dataset_id, params={}), allowing silent passes when an
expected document is missing; modify the test to first verify presence of each
expected id in res["data"]["docs"] (e.g., collect returned_ids = {doc["id"] for
doc in res["data"]["docs"]} and assert each document_id in returned_ids) and
then perform the existing assertions on the matched doc entries from
res["data"]["docs"] (or implement pagination/adjust params to ensure all
expected ids are returned) so missing documents fail the test instead of being
skipped.

In `@test/testcases/test_web_api/test_document_app/test_list_documents.py`:
- Around line 44-45: The tests are still passing kb_id both as a path param and
as a query param, so they never catch regressions where the path param is
ignored; update the test helper used by list_documents (and analogous helpers)
so when you build the URL using params["kb_id"] to form
/datasets/{kb_id}/documents you remove kb_id from the params passed to
requests.get (i.e., call list_documents(WebApiAuth, params_without_kb_id) or
strip "kb_id" before forwarding), and then update the affected test cases (those
calling list_documents with {"kb_id": kb_id}) to rely only on the path parameter
rather than sending kb_id as a query to ensure the new path-only behavior is
tested.

---

Duplicate comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py`:
- Around line 45-46: Tests rely on list_documents to filter by id/name but
list_documents no longer does that; update the tests (references: test_name,
test_meta_fields, test_chunk_method, test_parser_config) to explicitly find the
target document in the returned list (e.g., search for matching id or name using
the doc_of_id variable or name string), assert that the document is present
before accessing its fields, and replace any implicit use of the “first doc”
with an explicit lookup and clear assertion so failures are deterministic and
informative.

---

Nitpick comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py`:
- Around line 44-46: In validate_document_details, add lightweight debug logging
that reports the requested document count and the returned document count from
the dataset-level list_documents call: create/get a logger (e.g.,
logging.getLogger(__name__)), then after res = list_documents(auth, dataset_id,
params={}) log a debug message showing len(document_ids) and the number of
documents returned (e.g., len(res["documents"]) or the actual collection in
res), and optionally log which requested IDs were missing by computing the set
difference between document_ids and the returned document IDs; reference
validate_document_details, list_documents, res and document_ids when making
these additions.

In
`@test/testcases/test_sdk_api/test_file_management_within_dataset/test_update_document.py`:
- Around line 67-68: The list_documents() result is being indexed with [0] after
filtering which is brittle and can raise IndexError if the target doc isn't
present; update the lookup to use a safe search such as next((d for d in docs if
d.id == document.id), None) and then assert or raise a clear error if None
(e.g., assert updated_doc is not None, f"document {document.id} not found in
list_documents()") so failures are diagnostic; apply the same fix to the other
similar occurrences that filter list_documents() and then index with [0] (the
same pattern around updated_doc/document.id in this test file).

In `@test/testcases/test_web_api/test_document_app/test_list_documents.py`:
- Around line 185-191: Add a happy-path test next to
test_invalid_run_status_filter that calls list_documents(WebApiAuth, {"kb_id":
kb_id, "run": "<VALID_STATUS>"}) (use a valid semantic status used by the API,
e.g., "COMPLETED" or "SUCCEEDED") using the same fixtures (add_documents,
WebApiAuth), assert res["code"] == 0 (or the success code your API returns), and
assert that every returned document in res["data"] has a run/status field equal
to the requested status (or otherwise satisfies the filter). Ensure the new test
uses the same kb_id from add_documents and matches the response shape checked by
the other tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c0e088a-fb8b-4375-868b-284dab422296

📥 Commits

Reviewing files that changed from the base of the PR and between 25fa825 and 9173807.

📒 Files selected for processing (6)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_update_document.py
  • test/testcases/test_sdk_api/test_file_management_within_dataset/test_update_document.py
  • test/testcases/test_web_api/test_document_app/test_list_documents.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_list_documents.py

Comment on lines +46 to +54
res = list_documents(auth, dataset_id, params={})
for document_id in document_ids:
res = list_documents(auth, dataset_id, params={"id": document_id})
doc = res["data"]["docs"][0]
assert doc["run"] == "DONE"
assert len(doc["process_begin_at"]) > 0
assert doc["process_duration"] > 0
assert doc["progress"] > 0
assert "Task done" in doc["progress_msg"]
for doc_of_id in res["data"]["docs"]:
if doc_of_id["id"] == document_id:
assert doc_of_id["run"] == "DONE"
assert len(doc_of_id["process_begin_at"]) > 0
assert doc_of_id["process_duration"] > 0
assert doc_of_id["progress"] > 0
assert "Task done" in doc_of_id["progress_msg"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent silent passes when requested documents are missing from the list response.

On Line 46 and Lines 47-54, the helper only asserts fields for IDs that happen to be present in the returned page. If an expected document_id is absent (e.g., pagination/default page size), the test still passes.

✅ Suggested fix
 def validate_document_details(auth, dataset_id, document_ids):
     # currently list_documents not support search by document id
-    res = list_documents(auth, dataset_id, params={})
-    for document_id in document_ids:
-        for doc_of_id in res["data"]["docs"]:
-            if doc_of_id["id"] == document_id:
-                assert doc_of_id["run"] == "DONE"
-                assert len(doc_of_id["process_begin_at"]) > 0
-                assert doc_of_id["process_duration"] > 0
-                assert doc_of_id["progress"] > 0
-                assert "Task done" in doc_of_id["progress_msg"]
+    res = list_documents(auth, dataset_id, params={"page_size": len(document_ids)})
+    assert res["code"] == 0
+    docs_by_id = {doc["id"]: doc for doc in res["data"]["docs"]}
+
+    missing_ids = [doc_id for doc_id in document_ids if doc_id not in docs_by_id]
+    assert not missing_ids, f"Documents not returned by list endpoint: {missing_ids}"
+
+    for document_id in document_ids:
+        doc_of_id = docs_by_id[document_id]
+        assert doc_of_id["run"] == "DONE"
+        assert len(doc_of_id["process_begin_at"]) > 0
+        assert doc_of_id["process_duration"] > 0
+        assert doc_of_id["progress"] > 0
+        assert "Task done" in doc_of_id["progress_msg"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
res = list_documents(auth, dataset_id, params={})
for document_id in document_ids:
res = list_documents(auth, dataset_id, params={"id": document_id})
doc = res["data"]["docs"][0]
assert doc["run"] == "DONE"
assert len(doc["process_begin_at"]) > 0
assert doc["process_duration"] > 0
assert doc["progress"] > 0
assert "Task done" in doc["progress_msg"]
for doc_of_id in res["data"]["docs"]:
if doc_of_id["id"] == document_id:
assert doc_of_id["run"] == "DONE"
assert len(doc_of_id["process_begin_at"]) > 0
assert doc_of_id["process_duration"] > 0
assert doc_of_id["progress"] > 0
assert "Task done" in doc_of_id["progress_msg"]
res = list_documents(auth, dataset_id, params={"page_size": len(document_ids)})
assert res["code"] == 0
docs_by_id = {doc["id"]: doc for doc in res["data"]["docs"]}
missing_ids = [doc_id for doc_id in document_ids if doc_id not in docs_by_id]
assert not missing_ids, f"Documents not returned by list endpoint: {missing_ids}"
for document_id in document_ids:
doc_of_id = docs_by_id[document_id]
assert doc_of_id["run"] == "DONE"
assert len(doc_of_id["process_begin_at"]) > 0
assert doc_of_id["process_duration"] > 0
assert doc_of_id["progress"] > 0
assert "Task done" in doc_of_id["progress_msg"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_parse_documents.py`
around lines 46 - 54, The test iterates expected document_ids but only asserts
fields for IDs that are present in the single page returned by
list_documents(auth, dataset_id, params={}), allowing silent passes when an
expected document is missing; modify the test to first verify presence of each
expected id in res["data"]["docs"] (e.g., collect returned_ids = {doc["id"] for
doc in res["data"]["docs"]} and assert each document_id in returned_ids) and
then perform the existing assertions on the matched doc entries from
res["data"]["docs"] (or implement pagination/adjust params to ensure all
expected ids are returned) so missing documents fail the test instead of being
skipped.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (f906a20) to head (57dfaa7).
⚠️ Report is 25 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14176   +/-   ##
=======================================
  Coverage   96.66%   96.66%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          667      667           
  Misses          8        8           
  Partials       15       15           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xugangqiang
Copy link
Copy Markdown
Contributor Author

@yingfeng @JinHai-CN @yuzhichang
Please help to review & merge

@yingfeng yingfeng merged commit 9399336 into infiniflow:main Apr 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants