Refactor: Consolidation WEB API & HTTP API for document get_filter by xugangqiang · Pull Request #14248 · infiniflow/ragflow

xugangqiang · 2026-04-21T05:57:58Z

What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/filter
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?type=filter

Type of change

Refactoring

coderabbitai · 2026-04-21T05:58:04Z

📝 Walkthrough

Walkthrough

This pull request removes the POST /filter endpoint from the document app and consolidates document filtering functionality into a GET parameter on the dataset documents endpoint. The changes propagate through the API layer, service signatures, frontend client code, and test cases to align with the new endpoint structure.

Changes

Cohort / File(s)	Summary
Backend Endpoint Removal `api/apps/document_app.py`	Removed the `@manager.route("/filter", methods=["POST"])` endpoint (`get_filter`) along with its authentication checks, request validation, and service integration logic.
Backend Filter Aggregation `api/apps/restful_apis/document_api.py`	Added conditional filter aggregation in `list_docs` when `type=filter` query parameter is present; introduces new `_aggregate_filters` helper to compute counts for suffix, run status, and metadata values; refactored `_get_docs_with_request` to handle `id` parameter via renamed `doc_ids` argument.
Service Layer Signature `api/db/services/document_service.py`	Updated `DocumentService.get_by_kb_id` signature to remove `doc_id` and `doc_ids_filter` parameters, replacing them with a single `doc_ids` parameter; adjusted filtering logic accordingly.
Test Utilities `test/testcases/test_web_api/test_common.py`	Updated `document_filter` test helper to accept `dataset_id` parameter and changed from POST request to `GET` with query parameters; updated `list_documents` to use new URL construction pattern.
Metadata Tests `test/testcases/test_web_api/test_document_app/test_document_metadata.py`	Updated test invocations to use new `document_filter` signature; removed unit tests that directly exercised the deleted `get_filter` endpoint; lowered priority marker for one test case.
Frontend Hook `web/src/hooks/use-document-request.ts`	Simplified `useGetDocumentFilter` query execution to call `documentFilter(knowledgeId \|\| id)` without passing keywords parameter.
Frontend Service `web/src/services/knowledge-service.ts`	Changed `documentFilter` method from POST to GET and updated implementation to pass dataset ID as URL parameter.
Frontend API Utilities `web/src/utils/api.ts`	Converted `getDatasetFilter` from a static URL string to a parameterized function that accepts `datasetId` and constructs dataset-scoped endpoint URL.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Refactor: Consolidation WEB API & HTTP API for document get_filter #14230: Implements the identical refactoring to remove the POST /filter endpoint and consolidate filtering into the GET /datasets/<dataset_id>/documents?type=filter endpoint with matching service and client updates.
Refactor: Standardize naming convention to camelCase #14079: Updates the same web API helpers and client hooks (getDatasetFilter, use-document-request, knowledge-service) with camelCase naming conventions.
Refactor: Consolidation WEB API & HTTP API for document list_docs #14176: Modifies the same document listing REST endpoint and document service call patterns for dataset-scoped document retrieval.

Suggested labels

☯️ refactor, size:XL, 🐖api, 🧪 test

Suggested reviewers

yingfeng
yuzhichang
JinHai-CN

Poem

🐰 A filter finds a cleaner home,
From POST to GET through code we roam,
The document endpoint takes a bow,
Consolidated queries show us how,
Refactoring hops along the way! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description addresses the core problem and includes the refactoring type, but lacks details on implementation scope, affected files, and testing approach.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main change: consolidating two document filtering endpoints into a single RESTful API, which is reflected across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

api/apps/restful_apis/document_api.py (1)
671-727: Consider: Code duplication with DocumentService.get_filter_by_kb_id.

The _aggregate_filters function duplicates the aggregation logic from DocumentService.get_filter_by_kb_id (document_service.py lines 189-276). Both functions:

Count documents by suffix

Count documents by run status

Aggregate metadata field values

Track empty metadata count

If the filter aggregation needs to operate on in-memory documents, consider extracting the shared aggregation logic to a utility function. However, as noted in the previous comment, using the existing service method that performs SQL-level aggregation would be more efficient and eliminate this duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 671 - 727, The
_aggregate_filters function duplicates aggregation logic already implemented in
DocumentService.get_filter_by_kb_id; refactor by extracting the shared
aggregation into a single utility (e.g., function aggregate_document_filters)
used by both places or, preferably, replace the in-memory aggregation call in
document_api.py to invoke DocumentService.get_filter_by_kb_id so SQL-level
aggregation is reused; ensure the new utility or service method returns the same
structure (keys: "suffix", "run_status", "metadata" with "empty_metadata") and
update callers (_aggregate_filters or its callers) to use that central function.
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
151-156: Consider: Test now validates route matching rather than business logic.

The test test_filter_missing_kb_id changed from validating business logic (missing KB ID in request body) to validating route matching (empty path segment causes 405). While this still catches the error case, the exact assertion on the error message "<MethodNotAllowed '405: Method Not Allowed'>" is brittle.

Consider using a substring match or just checking the code:
-        assert "<MethodNotAllowed '405: Method Not Allowed'>" == res["message"], res
+        assert "405" in res["message"] or "Method Not Allowed" in res["message"], res
This makes the test more resilient to minor message format changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 151 - 156, The test test_filter_missing_kb_id should not assert the
exact MethodNotAllowed string; update the assertions in this test (which calls
document_filter with WebApiAuth and an empty kb_id path segment) to be
resilient: keep asserting res["code"] == 100 but replace the exact-match on
res["message"] with either a substring check like "MethodNotAllowed" in
res["message"] or drop the message assertion entirely and only assert the code;
this touches the test_filter_missing_kb_id function and its use of
document_filter/WebApiAuth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 439-451: The filter aggregation currently calls _aggregate_filters
over the paginated docs returned by _get_docs_with_request, causing incomplete
filter results; change the branch handling request.args.get("type") == "filter"
to use DocumentService.get_filter_by_kb_id (or otherwise query all docs without
pagination) to perform SQL-level aggregation across the entire dataset_id and
params, then return get_json_result with the full "filter" payload; keep the
existing map_doc_keys/get_json_result usage for the non-filter branch and avoid
iterating the paginated docs for aggregation in _aggregate_filters.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 671-727: The _aggregate_filters function duplicates aggregation
logic already implemented in DocumentService.get_filter_by_kb_id; refactor by
extracting the shared aggregation into a single utility (e.g., function
aggregate_document_filters) used by both places or, preferably, replace the
in-memory aggregation call in document_api.py to invoke
DocumentService.get_filter_by_kb_id so SQL-level aggregation is reused; ensure
the new utility or service method returns the same structure (keys: "suffix",
"run_status", "metadata" with "empty_metadata") and update callers
(_aggregate_filters or its callers) to use that central function.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 151-156: The test test_filter_missing_kb_id should not assert the
exact MethodNotAllowed string; update the assertions in this test (which calls
document_filter with WebApiAuth and an empty kb_id path segment) to be
resilient: keep asserting res["code"] == 100 but replace the exact-match on
res["message"] with either a substring check like "MethodNotAllowed" in
res["message"] or drop the message assertion entirely and only assert the code;
this touches the test_filter_missing_kb_id function and its use of
document_filter/WebApiAuth.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 65f3f092-aa2b-4223-bf0f-c4cdbdc245ee

📥 Commits

Reviewing files that changed from the base of the PR and between b3891ba and 3a5cd39.

📒 Files selected for processing (8)

api/apps/document_app.py
api/apps/restful_apis/document_api.py
api/db/services/document_service.py
test/testcases/test_web_api/test_common.py
test/testcases/test_web_api/test_document_app/test_document_metadata.py
web/src/hooks/use-document-request.ts
web/src/services/knowledge-service.ts
web/src/utils/api.ts

coderabbitai · 2026-04-21T06:04:51Z

+    if request.args.get("type") == "filter":
+        docs_filter = _aggregate_filters(docs)
+        return get_json_result(data={"total": total, "filter": docs_filter})
+    else:
+        renamed_doc_list = [map_doc_keys(doc) for doc in docs]
+        for doc_item in renamed_doc_list:
+            if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
+                doc_item["thumbnail"] = f"/v1/document/image/{dataset_id}-{doc_item['thumbnail']}"
+            if doc_item.get("source_type"):
+                doc_item["source_type"] = doc_item["source_type"].split("/")[0]
+            if doc_item["parser_config"].get("metadata"):
+                doc_item["parser_config"]["metadata"] = turn2jsonschema(doc_item["parser_config"]["metadata"])
+        return get_json_result(data={"total": total, "docs": renamed_doc_list})


⚠️ Potential issue | 🔴 Critical

Critical: Filter aggregation is limited by pagination, returning incomplete results.

When type=filter is requested, the code fetches documents using _get_docs_with_request which applies pagination (default page_size=30). The _aggregate_filters function then only aggregates the paginated subset, not all documents in the dataset.

This means the filter counts will be incorrect and incomplete. For example, a dataset with 1000 documents across 10 different file types might only show 2-3 types if the first 30 documents happen to be of those types.

The previous implementation (DocumentService.get_filter_by_kb_id) performed SQL-level aggregation across all matching documents.

Consider either:

Fetching all documents when type=filter (set page_size=0 or use a different query)

Using the existing DocumentService.get_filter_by_kb_id method for filter aggregation

if request.args.get("type") == "filter": - docs_filter = _aggregate_filters(docs) - return get_json_result(data={"total": total, "filter": docs_filter}) + # Use dedicated filter aggregation that queries all documents + keywords = request.args.get("keywords", "") + run_status = request.args.getlist("run") + run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"} + run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status] + types = request.args.getlist("types") + suffix = request.args.getlist("suffix") + docs_filter, total = DocumentService.get_filter_by_kb_id(dataset_id, keywords, run_status_converted, types, suffix) + return get_json_result(data={"total": total, "filter": docs_filter})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 439 - 451, The filter aggregation currently calls _aggregate_filters over the paginated docs returned by _get_docs_with_request, causing incomplete filter results; change the branch handling request.args.get("type") == "filter" to use DocumentService.get_filter_by_kb_id (or otherwise query all docs without pagination) to perform SQL-level aggregation across the entire dataset_id and params, then return get_json_result with the full "filter" payload; keep the existing map_doc_keys/get_json_result usage for the non-filter branch and avoid iterating the paginated docs for aggregation in _aggregate_filters.

xugangqiang added 12 commits April 17, 2026 10:38

merge list_docs api

375edba

fix

1cf35bd

Merge remote-tracking branch 'upstream/main' into merge-doc-list

93454f3

fix

b91b9df

trigger CI

d0bdcb0

trigger CI

f251468

trigger ci

ca02eaa

fix keyword filtering

d52bdba

Add id/name support for list docs

57dfaa7

merge filter

e5fc37c

Merge remote-tracking branch 'upstream/main' into migrate-doc-get-filter

179fea0

fix

3a5cd39

xugangqiang self-assigned this Apr 21, 2026

xugangqiang added the ci Continue Integration label Apr 21, 2026

xugangqiang changed the title ~~Refactor: Consolidation WEB API & HTTP API for document get_filter~~ WIP Refactor: Consolidation WEB API & HTTP API for document get_filter Apr 21, 2026

xugangqiang marked this pull request as ready for review April 21, 2026 05:59

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. ☯️ refactor Pull request that refactor/refine code labels Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

yingfeng changed the title ~~WIP Refactor: Consolidation WEB API & HTTP API for document get_filter~~ Refactor: Consolidation WEB API & HTTP API for document get_filter Apr 21, 2026

JinHai-CN merged commit 009e538 into infiniflow:main Apr 21, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Consolidation WEB API & HTTP API for document get_filter#14248

Refactor: Consolidation WEB API & HTTP API for document get_filter#14248
JinHai-CN merged 12 commits intoinfiniflow:mainfrom
xugangqiang:migrate-doc-get-filter

xugangqiang commented Apr 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xugangqiang commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xugangqiang commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading