Skip to content

Refactor: Consolidation WEB API & HTTP API for document get_filter#14248

Merged
JinHai-CN merged 12 commits intoinfiniflow:mainfrom
xugangqiang:migrate-doc-get-filter
Apr 21, 2026
Merged

Refactor: Consolidation WEB API & HTTP API for document get_filter#14248
JinHai-CN merged 12 commits intoinfiniflow:mainfrom
xugangqiang:migrate-doc-get-filter

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 21, 2026

What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/filter
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?type=filter

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

This pull request removes the POST /filter endpoint from the document app and consolidates document filtering functionality into a GET parameter on the dataset documents endpoint. The changes propagate through the API layer, service signatures, frontend client code, and test cases to align with the new endpoint structure.

Changes

Cohort / File(s) Summary
Backend Endpoint Removal
api/apps/document_app.py
Removed the @manager.route("/filter", methods=["POST"]) endpoint (get_filter) along with its authentication checks, request validation, and service integration logic.
Backend Filter Aggregation
api/apps/restful_apis/document_api.py
Added conditional filter aggregation in list_docs when type=filter query parameter is present; introduces new _aggregate_filters helper to compute counts for suffix, run status, and metadata values; refactored _get_docs_with_request to handle id parameter via renamed doc_ids argument.
Service Layer Signature
api/db/services/document_service.py
Updated DocumentService.get_by_kb_id signature to remove doc_id and doc_ids_filter parameters, replacing them with a single doc_ids parameter; adjusted filtering logic accordingly.
Test Utilities
test/testcases/test_web_api/test_common.py
Updated document_filter test helper to accept dataset_id parameter and changed from POST request to GET with query parameters; updated list_documents to use new URL construction pattern.
Metadata Tests
test/testcases/test_web_api/test_document_app/test_document_metadata.py
Updated test invocations to use new document_filter signature; removed unit tests that directly exercised the deleted get_filter endpoint; lowered priority marker for one test case.
Frontend Hook
web/src/hooks/use-document-request.ts
Simplified useGetDocumentFilter query execution to call documentFilter(knowledgeId || id) without passing keywords parameter.
Frontend Service
web/src/services/knowledge-service.ts
Changed documentFilter method from POST to GET and updated implementation to pass dataset ID as URL parameter.
Frontend API Utilities
web/src/utils/api.ts
Converted getDatasetFilter from a static URL string to a parameterized function that accepts datasetId and constructs dataset-scoped endpoint URL.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

☯️ refactor, size:XL, 🐖api, 🧪 test

Suggested reviewers

  • yingfeng
  • yuzhichang
  • JinHai-CN

Poem

🐰 A filter finds a cleaner home,
From POST to GET through code we roam,
The document endpoint takes a bow,
Consolidated queries show us how,
Refactoring hops along the way! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description addresses the core problem and includes the refactoring type, but lacks details on implementation scope, affected files, and testing approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main change: consolidating two document filtering endpoints into a single RESTful API, which is reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang self-assigned this Apr 21, 2026
@xugangqiang xugangqiang added the ci Continue Integration label Apr 21, 2026
@xugangqiang xugangqiang changed the title Refactor: Consolidation WEB API & HTTP API for document get_filter WIP Refactor: Consolidation WEB API & HTTP API for document get_filter Apr 21, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 21, 2026 05:59
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. ☯️ refactor Pull request that refactor/refine code labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (1)

671-727: Consider: Code duplication with DocumentService.get_filter_by_kb_id.

The _aggregate_filters function duplicates the aggregation logic from DocumentService.get_filter_by_kb_id (document_service.py lines 189-276). Both functions:

  • Count documents by suffix
  • Count documents by run status
  • Aggregate metadata field values
  • Track empty metadata count

If the filter aggregation needs to operate on in-memory documents, consider extracting the shared aggregation logic to a utility function. However, as noted in the previous comment, using the existing service method that performs SQL-level aggregation would be more efficient and eliminate this duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 671 - 727, The
_aggregate_filters function duplicates aggregation logic already implemented in
DocumentService.get_filter_by_kb_id; refactor by extracting the shared
aggregation into a single utility (e.g., function aggregate_document_filters)
used by both places or, preferably, replace the in-memory aggregation call in
document_api.py to invoke DocumentService.get_filter_by_kb_id so SQL-level
aggregation is reused; ensure the new utility or service method returns the same
structure (keys: "suffix", "run_status", "metadata" with "empty_metadata") and
update callers (_aggregate_filters or its callers) to use that central function.
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

151-156: Consider: Test now validates route matching rather than business logic.

The test test_filter_missing_kb_id changed from validating business logic (missing KB ID in request body) to validating route matching (empty path segment causes 405). While this still catches the error case, the exact assertion on the error message "<MethodNotAllowed '405: Method Not Allowed'>" is brittle.

Consider using a substring match or just checking the code:

-        assert "<MethodNotAllowed '405: Method Not Allowed'>" == res["message"], res
+        assert "405" in res["message"] or "Method Not Allowed" in res["message"], res

This makes the test more resilient to minor message format changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 151 - 156, The test test_filter_missing_kb_id should not assert the
exact MethodNotAllowed string; update the assertions in this test (which calls
document_filter with WebApiAuth and an empty kb_id path segment) to be
resilient: keep asserting res["code"] == 100 but replace the exact-match on
res["message"] with either a substring check like "MethodNotAllowed" in
res["message"] or drop the message assertion entirely and only assert the code;
this touches the test_filter_missing_kb_id function and its use of
document_filter/WebApiAuth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 439-451: The filter aggregation currently calls _aggregate_filters
over the paginated docs returned by _get_docs_with_request, causing incomplete
filter results; change the branch handling request.args.get("type") == "filter"
to use DocumentService.get_filter_by_kb_id (or otherwise query all docs without
pagination) to perform SQL-level aggregation across the entire dataset_id and
params, then return get_json_result with the full "filter" payload; keep the
existing map_doc_keys/get_json_result usage for the non-filter branch and avoid
iterating the paginated docs for aggregation in _aggregate_filters.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 671-727: The _aggregate_filters function duplicates aggregation
logic already implemented in DocumentService.get_filter_by_kb_id; refactor by
extracting the shared aggregation into a single utility (e.g., function
aggregate_document_filters) used by both places or, preferably, replace the
in-memory aggregation call in document_api.py to invoke
DocumentService.get_filter_by_kb_id so SQL-level aggregation is reused; ensure
the new utility or service method returns the same structure (keys: "suffix",
"run_status", "metadata" with "empty_metadata") and update callers
(_aggregate_filters or its callers) to use that central function.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 151-156: The test test_filter_missing_kb_id should not assert the
exact MethodNotAllowed string; update the assertions in this test (which calls
document_filter with WebApiAuth and an empty kb_id path segment) to be
resilient: keep asserting res["code"] == 100 but replace the exact-match on
res["message"] with either a substring check like "MethodNotAllowed" in
res["message"] or drop the message assertion entirely and only assert the code;
this touches the test_filter_missing_kb_id function and its use of
document_filter/WebApiAuth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 65f3f092-aa2b-4223-bf0f-c4cdbdc245ee

📥 Commits

Reviewing files that changed from the base of the PR and between b3891ba and 3a5cd39.

📒 Files selected for processing (8)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • api/db/services/document_service.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/hooks/use-document-request.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts

Comment on lines +439 to +451
if request.args.get("type") == "filter":
docs_filter = _aggregate_filters(docs)
return get_json_result(data={"total": total, "filter": docs_filter})
else:
renamed_doc_list = [map_doc_keys(doc) for doc in docs]
for doc_item in renamed_doc_list:
if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
if doc_item.get("source_type"):
doc_item["source_type"] = doc_item["source_type"].split("/")[0]
if doc_item["parser_config"].get("metadata"):
doc_item["parser_config"]["metadata"] = turn2jsonschema(doc_item["parser_config"]["metadata"])
return get_json_result(data={"total": total, "docs": renamed_doc_list})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Filter aggregation is limited by pagination, returning incomplete results.

When type=filter is requested, the code fetches documents using _get_docs_with_request which applies pagination (default page_size=30). The _aggregate_filters function then only aggregates the paginated subset, not all documents in the dataset.

This means the filter counts will be incorrect and incomplete. For example, a dataset with 1000 documents across 10 different file types might only show 2-3 types if the first 30 documents happen to be of those types.

The previous implementation (DocumentService.get_filter_by_kb_id) performed SQL-level aggregation across all matching documents.

Consider either:

  1. Fetching all documents when type=filter (set page_size=0 or use a different query)
  2. Using the existing DocumentService.get_filter_by_kb_id method for filter aggregation
     if request.args.get("type") == "filter":
-        docs_filter = _aggregate_filters(docs)
-        return get_json_result(data={"total": total, "filter": docs_filter})
+        # Use dedicated filter aggregation that queries all documents
+        keywords = request.args.get("keywords", "")
+        run_status = request.args.getlist("run")
+        run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
+        run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status]
+        types = request.args.getlist("types")
+        suffix = request.args.getlist("suffix")
+        docs_filter, total = DocumentService.get_filter_by_kb_id(dataset_id, keywords, run_status_converted, types, suffix)
+        return get_json_result(data={"total": total, "filter": docs_filter})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 439 - 451, The filter
aggregation currently calls _aggregate_filters over the paginated docs returned
by _get_docs_with_request, causing incomplete filter results; change the branch
handling request.args.get("type") == "filter" to use
DocumentService.get_filter_by_kb_id (or otherwise query all docs without
pagination) to perform SQL-level aggregation across the entire dataset_id and
params, then return get_json_result with the full "filter" payload; keep the
existing map_doc_keys/get_json_result usage for the non-filter branch and avoid
iterating the paginated docs for aggregation in _aggregate_filters.

@yingfeng yingfeng changed the title WIP Refactor: Consolidation WEB API & HTTP API for document get_filter Refactor: Consolidation WEB API & HTTP API for document get_filter Apr 21, 2026
@JinHai-CN JinHai-CN merged commit 009e538 into infiniflow:main Apr 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration ☯️ refactor Pull request that refactor/refine code size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants