Refactor: Consolidation WEB API & HTTP API for document get_filter#14248
Refactor: Consolidation WEB API & HTTP API for document get_filter#14248JinHai-CN merged 12 commits intoinfiniflow:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request removes the POST Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (1)
671-727: Consider: Code duplication withDocumentService.get_filter_by_kb_id.The
_aggregate_filtersfunction duplicates the aggregation logic fromDocumentService.get_filter_by_kb_id(document_service.py lines 189-276). Both functions:
- Count documents by suffix
- Count documents by run status
- Aggregate metadata field values
- Track empty metadata count
If the filter aggregation needs to operate on in-memory documents, consider extracting the shared aggregation logic to a utility function. However, as noted in the previous comment, using the existing service method that performs SQL-level aggregation would be more efficient and eliminate this duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 671 - 727, The _aggregate_filters function duplicates aggregation logic already implemented in DocumentService.get_filter_by_kb_id; refactor by extracting the shared aggregation into a single utility (e.g., function aggregate_document_filters) used by both places or, preferably, replace the in-memory aggregation call in document_api.py to invoke DocumentService.get_filter_by_kb_id so SQL-level aggregation is reused; ensure the new utility or service method returns the same structure (keys: "suffix", "run_status", "metadata" with "empty_metadata") and update callers (_aggregate_filters or its callers) to use that central function.test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
151-156: Consider: Test now validates route matching rather than business logic.The test
test_filter_missing_kb_idchanged from validating business logic (missing KB ID in request body) to validating route matching (empty path segment causes 405). While this still catches the error case, the exact assertion on the error message"<MethodNotAllowed '405: Method Not Allowed'>"is brittle.Consider using a substring match or just checking the code:
- assert "<MethodNotAllowed '405: Method Not Allowed'>" == res["message"], res + assert "405" in res["message"] or "Method Not Allowed" in res["message"], resThis makes the test more resilient to minor message format changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 151 - 156, The test test_filter_missing_kb_id should not assert the exact MethodNotAllowed string; update the assertions in this test (which calls document_filter with WebApiAuth and an empty kb_id path segment) to be resilient: keep asserting res["code"] == 100 but replace the exact-match on res["message"] with either a substring check like "MethodNotAllowed" in res["message"] or drop the message assertion entirely and only assert the code; this touches the test_filter_missing_kb_id function and its use of document_filter/WebApiAuth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 439-451: The filter aggregation currently calls _aggregate_filters
over the paginated docs returned by _get_docs_with_request, causing incomplete
filter results; change the branch handling request.args.get("type") == "filter"
to use DocumentService.get_filter_by_kb_id (or otherwise query all docs without
pagination) to perform SQL-level aggregation across the entire dataset_id and
params, then return get_json_result with the full "filter" payload; keep the
existing map_doc_keys/get_json_result usage for the non-filter branch and avoid
iterating the paginated docs for aggregation in _aggregate_filters.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 671-727: The _aggregate_filters function duplicates aggregation
logic already implemented in DocumentService.get_filter_by_kb_id; refactor by
extracting the shared aggregation into a single utility (e.g., function
aggregate_document_filters) used by both places or, preferably, replace the
in-memory aggregation call in document_api.py to invoke
DocumentService.get_filter_by_kb_id so SQL-level aggregation is reused; ensure
the new utility or service method returns the same structure (keys: "suffix",
"run_status", "metadata" with "empty_metadata") and update callers
(_aggregate_filters or its callers) to use that central function.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 151-156: The test test_filter_missing_kb_id should not assert the
exact MethodNotAllowed string; update the assertions in this test (which calls
document_filter with WebApiAuth and an empty kb_id path segment) to be
resilient: keep asserting res["code"] == 100 but replace the exact-match on
res["message"] with either a substring check like "MethodNotAllowed" in
res["message"] or drop the message assertion entirely and only assert the code;
this touches the test_filter_missing_kb_id function and its use of
document_filter/WebApiAuth.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 65f3f092-aa2b-4223-bf0f-c4cdbdc245ee
📒 Files selected for processing (8)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pyapi/db/services/document_service.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/hooks/use-document-request.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
| if request.args.get("type") == "filter": | ||
| docs_filter = _aggregate_filters(docs) | ||
| return get_json_result(data={"total": total, "filter": docs_filter}) | ||
| else: | ||
| renamed_doc_list = [map_doc_keys(doc) for doc in docs] | ||
| for doc_item in renamed_doc_list: | ||
| if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX): | ||
| doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}" | ||
| if doc_item.get("source_type"): | ||
| doc_item["source_type"] = doc_item["source_type"].split("/")[0] | ||
| if doc_item["parser_config"].get("metadata"): | ||
| doc_item["parser_config"]["metadata"] = turn2jsonschema(doc_item["parser_config"]["metadata"]) | ||
| return get_json_result(data={"total": total, "docs": renamed_doc_list}) |
There was a problem hiding this comment.
Critical: Filter aggregation is limited by pagination, returning incomplete results.
When type=filter is requested, the code fetches documents using _get_docs_with_request which applies pagination (default page_size=30). The _aggregate_filters function then only aggregates the paginated subset, not all documents in the dataset.
This means the filter counts will be incorrect and incomplete. For example, a dataset with 1000 documents across 10 different file types might only show 2-3 types if the first 30 documents happen to be of those types.
The previous implementation (DocumentService.get_filter_by_kb_id) performed SQL-level aggregation across all matching documents.
Consider either:
- Fetching all documents when
type=filter(setpage_size=0or use a different query) - Using the existing
DocumentService.get_filter_by_kb_idmethod for filter aggregation
if request.args.get("type") == "filter":
- docs_filter = _aggregate_filters(docs)
- return get_json_result(data={"total": total, "filter": docs_filter})
+ # Use dedicated filter aggregation that queries all documents
+ keywords = request.args.get("keywords", "")
+ run_status = request.args.getlist("run")
+ run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
+ run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status]
+ types = request.args.getlist("types")
+ suffix = request.args.getlist("suffix")
+ docs_filter, total = DocumentService.get_filter_by_kb_id(dataset_id, keywords, run_status_converted, types, suffix)
+ return get_json_result(data={"total": total, "filter": docs_filter})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 439 - 451, The filter
aggregation currently calls _aggregate_filters over the paginated docs returned
by _get_docs_with_request, causing incomplete filter results; change the branch
handling request.args.get("type") == "filter" to use
DocumentService.get_filter_by_kb_id (or otherwise query all docs without
pagination) to perform SQL-level aggregation across the entire dataset_id and
params, then return get_json_result with the full "filter" payload; keep the
existing map_doc_keys/get_json_result usage for the non-filter branch and avoid
iterating the paginated docs for aggregation in _aggregate_filters.
What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/filter
Http API - GET /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?type=filter
Type of change