Refactor: Consolidation WEB API & HTTP API for document infos#14239
Refactor: Consolidation WEB API & HTTP API for document infos#14239JinHai-CN merged 18 commits intoinfiniflow:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR consolidates POST Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
sdk/python/test/test_frontend_api/test_chunk.py (1)
58-63:⚠️ Potential issue | 🟠 MajorPoll
res["data"]["docs"], notres["data"].The migrated endpoint now returns
data = {"total": ..., "docs": [...]}. Iteratingres["data"]walks object keys, sodoc_info["progress"]fails instead of checking document completion.💡 Minimal fix
- for doc_info in res['data']: + for doc_info in res['data']['docs']:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/python/test/test_frontend_api/test_chunk.py` around lines 58 - 63, The test currently iterates over res (from get_docs_info) as for doc_info in res['data'] which walks object keys and makes doc_info['progress'] fail; change the loop to iterate the docs array returned by the migrated endpoint (res['data']['docs']) — e.g. read docs = res['data']['docs'] or loop for doc_info in res['data']['docs'] when checking doc_info['progress'] compared to doc_count/doc_id_list in the test using get_docs_info.sdk/python/ragflow_sdk/modules/dataset.py (1)
83-98:⚠️ Potential issue | 🔴 Critical
params.append()will fail—paramsis a dict, not a list.When
idsis provided, the code callsparams.append(("ids", doc_id))on a dict object, raisingAttributeErrorbefore the request is sent. Therequestslibrary accepts a list of tuples for repeated query parameters. Additionally, the newidsflow lacks logging, which the coding guidelines require for new flows in Python files.Fix
- params = { - "id": id, - "name": name, - "keywords": keywords, - "page": page, - "page_size": page_size, - "orderby": orderby, - "desc": desc, - "create_time_from": create_time_from, - "create_time_to": create_time_to, - } + params = [ + (k, v) + for k, v in { + "id": id, + "name": name, + "keywords": keywords, + "page": page, + "page_size": page_size, + "orderby": orderby, + "desc": desc, + "create_time_from": create_time_from, + "create_time_to": create_time_to, + }.items() + if v is not None + ] # Handle ids parameter - convert to multiple query params if ids: for doc_id in ids: params.append(("ids", doc_id))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/python/ragflow_sdk/modules/dataset.py` around lines 83 - 98, The params variable is a dict so params.append will raise an AttributeError; change the implementation to build query params as a list of tuples when ids is provided (e.g., convert the base params dict to a list of (key, value) pairs or create a new params_list and extend it with ("ids", doc_id) for each id), then pass that list to self.get(f"/datasets/{self.id}/documents", params=...) so repeated ids become repeated query params; also add a debug log (using the module/class logger) before the request that records the ids being included (reference: params, ids, and self.get).api/apps/restful_apis/document_api.py (1)
435-451:⚠️ Potential issue | 🟠 Major
type=filteris aggregating only the current page.
_get_docs_with_request()appliespage/page_sizebefore this branch, so the returned filter buckets are built from at most one page of docs whiletotalstill reflects the wider match set. On larger datasets this will produce incomplete filter counts and misleading UI state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 435 - 451, The filter branch is aggregating only the paginated page because _get_docs_with_request() applies page/page_size; change the logic so when request.args.get("type") == "filter" you fetch the full match set before calling _aggregate_filters — e.g., call _get_docs_with_request for the same query but with pagination disabled (or page_size set to total/match-all) so docs contains all matches, then pass that full docs list to _aggregate_filters and return get_json_result with the original total; keep the rest of the response handling (map_doc_keys, IMG_BASE64_PREFIX, turn2jsonschema) unchanged.test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
51-58:⚠️ Potential issue | 🟡 MinorRestore the migrated
metadata_summaryweb-API coverage.These cases are commented out instead of being adapted to the dataset-scoped GET route, so this refactor drops regression coverage for that changed behavior at the web-test layer.
If you want, I can draft the updated pytest cases for the new request shape. Based on learnings "Applies to tests/**/*.py : Add/adjust tests for behavior changes".
Also applies to: 101-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 51 - 58, The commented-out auth tests removed coverage for the migrated metadata_summary endpoint; restore and adapt them by re-enabling test_metadata_summary_auth_invalid and updating calls to document_metadata_summary to send the new GET request shape (include "doc_ids": [] or appropriate list and keep "kb_id" if still required) so each INVALID_AUTH_CASES case asserts res["code"] == expected_code and expected_fragment in res["message"]; also update the equivalent commented block around lines covering 101-108 the same way so both sets use the new request payload shape and existing helpers (document_metadata_summary, INVALID_AUTH_CASES) rather than the old payload.
🧹 Nitpick comments (2)
api/db/services/document_service.py (1)
150-151: Add a trace point for the newdoc_idsbranch.The consolidated document-info flow now has a dedicated ID-filter path, but nothing records when it is used. A small debug log with
kb_idand the requested ID count would make rollout issues much easier to diagnose without logging the raw IDs.As per coding guidelines, "Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/db/services/document_service.py` around lines 150 - 151, Add a debug trace when the doc_ids branch is used: before applying the filter (the block with "if doc_ids: docs = docs.where(cls.model.id.in_(doc_ids))") emit a debug log that records the kb_id and the number of requested IDs (len(doc_ids)) but not the raw IDs; use the service/module logger (e.g., cls.logger or module-level logger) and a message like "Applying doc_ids filter for kb_id=%s count=%d" passing kb_id and len(doc_ids) so callers can see the new ID-filter path is taken without leaking ID values.api/apps/restful_apis/document_api.py (1)
439-451: Add trace logging for the new consolidated query modes.The new
type=filterand repeated-idsbranches materially change this endpoint's behavior, but there is no log telling us which branch was taken or how many IDs were supplied. A small structured log here would make this migration much easier to debug in CI and prod.As per coding guidelines "Add logging for new flows".
Also applies to: 530-534
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 439 - 451, Add structured trace logging to record which consolidated query branch is taken and relevant counts: when evaluating request.args.get("type") == "filter" (the _aggregate_filters(docs) branch) log at trace/debug level that the "filter" branch was taken and include total and number of filters/docs; in the else branch (where renamed_doc_list is built via map_doc_keys and thumbnail/source_type/parser_config adjustments) log that the "docs" branch was taken and include total and len(renamed_doc_list); also add analogous logs for the repeated-`ids` handling mentioned (the branch around lines 530-534) to record the branch and number of ids supplied. Use the existing logger used in this module and include clear keys (e.g., branch="filter" or branch="docs", total=..., docs_count=..., ids_count=...) so logs are structured and searchable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 523-537: The code validates a single doc_id but not the list
doc_ids; before calling DocumentService.get_by_kb_id() iterate over
q.getlist("ids") (doc_ids) and for each id call DocumentService.query(id=<id>,
kb_id=dataset_id) and if any check fails return RetCode.DATA_ERROR with a
message like "You don't own the document <id>" (mirroring the single-id branch),
then set doc_ids_filter = doc_ids and proceed; ensure you also preserve the
existing mutual-exclusion check between doc_id and doc_ids.
In `@sdk/python/test/test_frontend_api/common.py`:
- Around line 96-103: The bug is that params is initialized as a dict but used
like a list when handling repeated query params in the function handling
doc_ids/doc_id; change params = {} to params = [] and update the single-ID
branch to use params.append(("id", doc_id)) so both branches build a list of
tuples (used for repeated query parameters), ensuring symbols to edit are
params, doc_ids and doc_id; also run ruff check --fix to address the W293
whitespace violations reported around the nearby lines.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 46-48: The test uses the old document-scoped contract: update
test_infos_auth_invalid and the other failing cases to the dataset-scoped
contract by replacing payload key "doc_ids" with "ids" and by changing calls
that currently use the removed (auth, payload) shape to the new (auth,
dataset_id, payload) shape; e.g., call document_infos(invalid_auth, "kb_id",
{"ids": ["doc_id"]}) instead of document_infos(invalid_auth, {"doc_ids":
["doc_id"]}) so the tests hit /datasets/{dataset_id}/documents with the new
payload format.
In `@web/src/hooks/use-document-request.ts`:
- Around line 218-220: The call to documentFilter in use-document-request.ts is
not passing the active search term, so sidebar counts use the whole dataset;
change the invocation of documentFilter (the line calling
documentFilter(knowledgeId || id)) to include the current keywords parameter
(forward the keywords variable used by useFetchDocumentList) in the request
payload/args so the filter endpoint receives the search term; ensure you pass
the same keywords shape the filter endpoint expects (e.g., add a second arg or
include keywords on the options object) and keep the existing response handling
(data.code === 0) intact.
---
Outside diff comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 435-451: The filter branch is aggregating only the paginated page
because _get_docs_with_request() applies page/page_size; change the logic so
when request.args.get("type") == "filter" you fetch the full match set before
calling _aggregate_filters — e.g., call _get_docs_with_request for the same
query but with pagination disabled (or page_size set to total/match-all) so docs
contains all matches, then pass that full docs list to _aggregate_filters and
return get_json_result with the original total; keep the rest of the response
handling (map_doc_keys, IMG_BASE64_PREFIX, turn2jsonschema) unchanged.
In `@sdk/python/ragflow_sdk/modules/dataset.py`:
- Around line 83-98: The params variable is a dict so params.append will raise
an AttributeError; change the implementation to build query params as a list of
tuples when ids is provided (e.g., convert the base params dict to a list of
(key, value) pairs or create a new params_list and extend it with ("ids",
doc_id) for each id), then pass that list to
self.get(f"/datasets/{self.id}/documents", params=...) so repeated ids become
repeated query params; also add a debug log (using the module/class logger)
before the request that records the ids being included (reference: params, ids,
and self.get).
In `@sdk/python/test/test_frontend_api/test_chunk.py`:
- Around line 58-63: The test currently iterates over res (from get_docs_info)
as for doc_info in res['data'] which walks object keys and makes
doc_info['progress'] fail; change the loop to iterate the docs array returned by
the migrated endpoint (res['data']['docs']) — e.g. read docs =
res['data']['docs'] or loop for doc_info in res['data']['docs'] when checking
doc_info['progress'] compared to doc_count/doc_id_list in the test using
get_docs_info.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 51-58: The commented-out auth tests removed coverage for the
migrated metadata_summary endpoint; restore and adapt them by re-enabling
test_metadata_summary_auth_invalid and updating calls to
document_metadata_summary to send the new GET request shape (include "doc_ids":
[] or appropriate list and keep "kb_id" if still required) so each
INVALID_AUTH_CASES case asserts res["code"] == expected_code and
expected_fragment in res["message"]; also update the equivalent commented block
around lines covering 101-108 the same way so both sets use the new request
payload shape and existing helpers (document_metadata_summary,
INVALID_AUTH_CASES) rather than the old payload.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 439-451: Add structured trace logging to record which consolidated
query branch is taken and relevant counts: when evaluating
request.args.get("type") == "filter" (the _aggregate_filters(docs) branch) log
at trace/debug level that the "filter" branch was taken and include total and
number of filters/docs; in the else branch (where renamed_doc_list is built via
map_doc_keys and thumbnail/source_type/parser_config adjustments) log that the
"docs" branch was taken and include total and len(renamed_doc_list); also add
analogous logs for the repeated-`ids` handling mentioned (the branch around
lines 530-534) to record the branch and number of ids supplied. Use the existing
logger used in this module and include clear keys (e.g., branch="filter" or
branch="docs", total=..., docs_count=..., ids_count=...) so logs are structured
and searchable.
In `@api/db/services/document_service.py`:
- Around line 150-151: Add a debug trace when the doc_ids branch is used: before
applying the filter (the block with "if doc_ids: docs =
docs.where(cls.model.id.in_(doc_ids))") emit a debug log that records the kb_id
and the number of requested IDs (len(doc_ids)) but not the raw IDs; use the
service/module logger (e.g., cls.logger or module-level logger) and a message
like "Applying doc_ids filter for kb_id=%s count=%d" passing kb_id and
len(doc_ids) so callers can see the new ID-filter path is taken without leaking
ID values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 96fd7e16-a49f-49bd-a686-b4cd443dcfb2
📒 Files selected for processing (11)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pyapi/db/services/document_service.pysdk/python/ragflow_sdk/modules/dataset.pysdk/python/test/test_frontend_api/common.pysdk/python/test/test_frontend_api/test_chunk.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/hooks/use-document-request.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
| if doc_id: | ||
| if not DocumentService.query(id=doc_id, kb_id=dataset_id): | ||
| return RetCode.DATA_ERROR, f"You don't own the document {doc_id}.", [], 0 | ||
| doc_ids_filter = [doc_id] # id provided, ignore other filters | ||
| if doc_name and not DocumentService.query(name=doc_name, kb_id=dataset_id): | ||
| return RetCode.DATA_ERROR, f"You don't own the document {doc_name}.", [], 0 | ||
|
|
||
| doc_ids = q.getlist("ids") | ||
| if doc_id and len(doc_ids) > 0: | ||
| return RetCode.DATA_ERROR, f"Should not provide both 'id':{doc_id} and 'ids'{doc_ids}" | ||
| if len(doc_ids) > 0: | ||
| doc_ids_filter = doc_ids | ||
|
|
||
| docs, total = DocumentService.get_by_kb_id(dataset_id, page, page_size, orderby, desc, keywords, run_status_converted, types, suffix, | ||
| doc_id=doc_id, name=doc_name, doc_ids_filter=doc_ids_filter, return_empty_metadata=return_empty_metadata) | ||
| name=doc_name, doc_ids=doc_ids_filter, return_empty_metadata=return_empty_metadata) |
There was a problem hiding this comment.
Validate ids the same way you validate id.
The single-document path rejects a non-owned/nonexistent id before querying, but the new repeated-ids path just forwards the list into get_by_kb_id(). That changes invalid multi-ID lookups from an explicit error into 200 with an empty docs list, so id and ids now behave inconsistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 523 - 537, The code
validates a single doc_id but not the list doc_ids; before calling
DocumentService.get_by_kb_id() iterate over q.getlist("ids") (doc_ids) and for
each id call DocumentService.query(id=<id>, kb_id=dataset_id) and if any check
fails return RetCode.DATA_ERROR with a message like "You don't own the document
<id>" (mirroring the single-id branch), then set doc_ids_filter = doc_ids and
proceed; ensure you also preserve the existing mutual-exclusion check between
doc_id and doc_ids.
| params = {} | ||
| if doc_ids: | ||
| # Multiple IDs | ||
| for id in doc_ids: | ||
| params.append(("ids", id)) | ||
| elif doc_id: | ||
| # Single ID | ||
| params["id"] = doc_id |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd sdk/python && find . -name "common.py" -path "*/test_frontend_api/*" | head -20Repository: infiniflow/ragflow
Length of output: 96
🏁 Script executed:
cat -n sdk/python/test/test_frontend_api/common.py | sed -n '90,110p'Repository: infiniflow/ragflow
Length of output: 828
🏁 Script executed:
cat -n sdk/python/test/test_frontend_api/common.py | sed -n '78,110p'Repository: infiniflow/ragflow
Length of output: 1334
🏁 Script executed:
# Check if similar pattern exists in dataset.py
find . -name "dataset.py" -path "*/ragflow_sdk/modules/*" | head -5Repository: infiniflow/ragflow
Length of output: 105
🏁 Script executed:
cat -n sdk/python/ragflow_sdk/modules/dataset.py | grep -A 20 "params.*append\|params.*\[\]" | head -40Repository: infiniflow/ragflow
Length of output: 1043
🏁 Script executed:
cat -n sdk/python/ragflow_sdk/modules/dataset.py | sed -n '85,110p'Repository: infiniflow/ragflow
Length of output: 1270
🏁 Script executed:
cat -n sdk/python/ragflow_sdk/modules/dataset.py | sed -n '65,100p'Repository: infiniflow/ragflow
Length of output: 1471
🏁 Script executed:
# Check ruff formatting/linting status of the file
cd sdk/python && python -m pip install ruff -q && ruff check test/test_frontend_api/common.py --select=E,F,W 2>&1 | head -50Repository: infiniflow/ragflow
Length of output: 2256
Fix params initialization to support repeated query parameters.
Line 96 initializes params = {}, but line 100 calls .append() on it. Dicts don't have .append(), so the first doc_ids request fails with AttributeError. The fix is to initialize params = [] and convert the single-ID branch to append as well:
Fix
- params = {}
+ params = []
if doc_ids:
# Multiple IDs
for id in doc_ids:
params.append(("ids", id))
elif doc_id:
# Single ID
- params["id"] = doc_id
+ params.append(("id", doc_id))Additionally, ruff reports 4 whitespace linting violations (W293) at lines 81, 87, 94, 104. Run ruff check --fix to resolve them per the coding guidelines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/python/test/test_frontend_api/common.py` around lines 96 - 103, The bug
is that params is initialized as a dict but used like a list when handling
repeated query params in the function handling doc_ids/doc_id; change params =
{} to params = [] and update the single-ID branch to use params.append(("id",
doc_id)) so both branches build a list of tuples (used for repeated query
parameters), ensuring symbols to edit are params, doc_ids and doc_id; also run
ruff check --fix to address the W293 whitespace violations reported around the
nearby lines.
| def test_infos_auth_invalid(self, invalid_auth, expected_code, expected_fragment): | ||
| res = document_infos(invalid_auth, {"doc_ids": ["doc_id"]}) | ||
| res = document_infos(invalid_auth, "kb_id", {"doc_ids": ["doc_id"]}) | ||
| assert res["code"] == expected_code, res |
There was a problem hiding this comment.
Update the remaining document_infos tests to the new dataset-scoped contract.
Line 47 still uses doc_ids instead of ids, and Lines 178-180 still use the removed (auth, payload) call shape. The invalid-doc case will now request /datasets/{'doc_ids': ['invalid_id']}/documents, so it no longer validates the migrated endpoint behavior.
Suggested test update
- res = document_infos(invalid_auth, "kb_id", {"doc_ids": ["doc_id"]})
+ res = document_infos(invalid_auth, "kb_id", {"ids": ["doc_id"]})
...
- def test_infos_invalid_doc_id(self, WebApiAuth):
- res = document_infos(WebApiAuth, {"doc_ids": ["invalid_id"]})
+ def test_infos_invalid_doc_id(self, WebApiAuth, add_document_func):
+ dataset_id, _ = add_document_func
+ res = document_infos(WebApiAuth, dataset_id, {"ids": ["invalid_id"]})Also applies to: 177-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 46 - 48, The test uses the old document-scoped contract: update
test_infos_auth_invalid and the other failing cases to the dataset-scoped
contract by replacing payload key "doc_ids" with "ids" and by changing calls
that currently use the removed (auth, payload) shape to the new (auth,
dataset_id, payload) shape; e.g., call document_infos(invalid_auth, "kb_id",
{"ids": ["doc_id"]}) instead of document_infos(invalid_auth, {"doc_ids":
["doc_id"]}) so the tests hit /datasets/{dataset_id}/documents with the new
payload format.
| const { data } = await documentFilter(knowledgeId || id); | ||
| if (data.code === 0) { | ||
| return data.data; |
There was a problem hiding this comment.
Forward keywords to the filter request.
useFetchDocumentList still filters the document list by keywords, but this call no longer does. Since the consolidated type=filter endpoint aggregates from the same request args, the sidebar counts will be computed for the whole dataset whenever a search term is active.
💡 Minimal fix
- const { data } = await documentFilter(knowledgeId || id);
+ const { data } = await documentFilter(knowledgeId || id, {
+ keywords: debouncedSearchString,
+ });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { data } = await documentFilter(knowledgeId || id); | |
| if (data.code === 0) { | |
| return data.data; | |
| const { data } = await documentFilter(knowledgeId || id, { | |
| keywords: debouncedSearchString, | |
| }); | |
| if (data.code === 0) { | |
| return data.data; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/src/hooks/use-document-request.ts` around lines 218 - 220, The call to
documentFilter in use-document-request.ts is not passing the active search term,
so sidebar counts use the whole dataset; change the invocation of documentFilter
(the line calling documentFilter(knowledgeId || id)) to include the current
keywords parameter (forward the keywords variable used by useFetchDocumentList)
in the request payload/args so the filter endpoint receives the search term;
ensure you pass the same keywords shape the filter endpoint expects (e.g., add a
second arg or include keywords on the options object) and keep the existing
response handling (data.code === 0) intact.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14239 +/- ##
==========================================
- Coverage 96.66% 95.97% -0.70%
==========================================
Files 10 10
Lines 690 695 +5
Branches 108 111 +3
==========================================
Hits 667 667
- Misses 8 11 +3
- Partials 15 17 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/infos
Http API - GET /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?ids=id1&ids=id2
Type of change