Skip to content

Refactor: Doc metadata update#14289

Merged
yingfeng merged 11 commits intoinfiniflow:mainfrom
xugangqiang:doc-metadata-update
Apr 23, 2026
Merged

Refactor: Doc metadata update#14289
yingfeng merged 11 commits intoinfiniflow:mainfrom
xugangqiang:doc-metadata-update

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 22, 2026

What problem does this PR solve?

Before migration
Web API: POST /v1/document/metadata/update

After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removed legacy document-app POST handlers and SDK endpoint; added dataset-scoped RESTful endpoints for batch metadata updates (PATCH) and per-document parser config (PUT); updated tests, frontend API surface, and UI hooks to call the new routes and request shapes.

Changes

Cohort / File(s) Summary
Removed legacy handlers
api/apps/document_app.py, api/apps/sdk/doc.py
Deleted old POST endpoints /metadata/update and /update_metadata_setting and their SDK wrapper.
New RESTful API
api/apps/restful_apis/document_api.py
Added PATCH /datasets/<dataset_id>/documents/metadatas (selector, updates, deletes validation, dataset/document checks, optional metadata_condition filtering, calls DocMetadataService.batch_update_metadata) and PUT /datasets/<dataset_id>/documents/<document_id>/metadata/config (updates parser config via DocumentService.update_parser_config).
Backend tests & helpers
test/testcases/test_http_api/common.py, test/testcases/test_http_api/.../test_doc_sdk_routes_unit.py, test/testcases/test_http_api/.../test_metadata_batch_update.py, test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/test_document_metadata.py
Added update_documents_metadata HTTP helper, removed old SDK-route unit test, updated E2E/validation tests to use new endpoints, added fixtures and expanded test coverage for validation and success cases.
Frontend API surface
web/src/utils/api.ts, web/src/services/knowledge-service.ts
Replaced updateMetaData with updateDocumentsMetadata(datasetId) and added updateDocumentMetaDataConfig; changed payloads to { selector, updates, deletes } and { data: { metadata } }.
Frontend UI hook
web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts
Refactored save flows to call updateDocumentsMetadata with { selector, updates, deletes } and call updateDocumentMetaDataConfig for single-file settings; adjusted payload construction and conditional calls.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Frontend
  participant REST_API
  participant KB_Service
  participant Doc_Service
  participant DocMetadata_Service

  Client->>Frontend: trigger metadata save
  Frontend->>REST_API: PATCH /datasets/{dataset_id}/documents/metadatas {selector, updates, deletes}
  REST_API->>KB_Service: verify dataset ownership / accessible(dataset_id, tenant)
  alt dataset accessible
    REST_API->>Doc_Service: list/filter documents (document_ids or metadata_condition)
    Doc_Service->>Doc_Service: fetch & flatten metadata for filtering
    REST_API->>DocMetadata_Service: batch_update_metadata(dataset_id, selector, updates, deletes)
    DocMetadata_Service-->>REST_API: {updated, matched_docs}
    REST_API-->>Frontend: 200 {updated, matched_docs}
    Frontend-->>Client: show success
  else not accessible
    KB_Service-->>REST_API: error
    REST_API-->>Frontend: 4xx error
    Frontend-->>Client: show error
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

☯️ refactor

Suggested reviewers

  • yuzhichang
  • JinHai-CN
  • yingfeng

Poem

🐰 I hopped from old paths to RESTful trails,
PATCH and PUT tucked in my tiny tails,
Selectors prance and updates chime,
Configs settle, one request at a time,
Metadata morning — hop, hooray!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete and lacks required detail. It only describes endpoint migrations but omits problem context, motivation, and impact analysis. Expand description with: (1) Why this refactoring is necessary, (2) What problem it solves, (3) Impact on existing code, (4) Migration notes for consumers, (5) Testing approach for the new endpoints.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main refactoring: moving document metadata update endpoints from legacy Web API to a new RESTful API structure, which is the primary focus of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang force-pushed the doc-metadata-update branch from 517f8b9 to a4d33d7 Compare April 22, 2026 07:54
@xugangqiang xugangqiang added the ci Continue Integration label Apr 22, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 22, 2026 07:56
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 🧪 test Pull requests that update test cases. labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py (1)

419-481: ⚠️ Potential issue | 🔴 Critical

Replace the remaining stale module.metadata_batch_update calls.

module is no longer initialized in this test, so Line 424 and the later monkeypatches/calls will fail with NameError instead of testing document_api_module.update_metadata.

🐛 Proposed fix
         monkeypatch.setattr(
             document_api_module,
             "get_request_json",
             lambda: _AwaitableValue({"selector": {"document_ids": "doc-1"}, "updates": [], "deletes": []}),
         )
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert res["message"] == "document_ids must be a list."
 
         monkeypatch.setattr(
-            module,
+            document_api_module,
             "get_request_json",
             lambda: _AwaitableValue({"selector": {}, "updates": [{"key": ""}], "deletes": []}),
         )
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert "Each update requires key and value." in res["message"]
 
         monkeypatch.setattr(
-            module,
+            document_api_module,
             "get_request_json",
             lambda: _AwaitableValue({"selector": {}, "updates": [], "deletes": [{"x": "y"}]}),
         )
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert "Each delete requires key." in res["message"]
 
         monkeypatch.setattr(
-            module,
+            document_api_module,
             "get_request_json",
             lambda: _AwaitableValue(
                 {
@@
                 }
             ),
         )
-        monkeypatch.setattr(module.KnowledgebaseService, "list_documents_by_ids", lambda _ids: ["doc-1"])
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        monkeypatch.setattr(document_api_module.KnowledgebaseService, "list_documents_by_ids", lambda _ids: ["doc-1"])
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert "do not belong to dataset" in res["message"]
 
         monkeypatch.setattr(
-            module,
+            document_api_module,
             "get_request_json",
             lambda: _AwaitableValue(
                 {
@@
                 }
             ),
         )
-        monkeypatch.setattr(module, "meta_filter", lambda *_args, **_kwargs: [])
-        monkeypatch.setattr(module.DocMetadataService, "get_flatted_meta_by_kbs", lambda _kbs: [])
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        monkeypatch.setattr(document_api_module, "meta_filter", lambda *_args, **_kwargs: [])
+        monkeypatch.setattr(document_api_module.DocMetadataService, "get_flatted_meta_by_kbs", lambda _kbs: [])
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert res["code"] == 0
         assert res["data"]["updated"] == 0
         assert res["data"]["matched_docs"] == 0
 
-        monkeypatch.setattr(module, "meta_filter", lambda *_args, **_kwargs: ["doc-1"])
-        monkeypatch.setattr(module.DocMetadataService, "batch_update_metadata", lambda *_args, **_kwargs: 1)
-        res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1"))
+        monkeypatch.setattr(document_api_module, "meta_filter", lambda *_args, **_kwargs: ["doc-1"])
+        monkeypatch.setattr(document_api_module.DocMetadataService, "batch_update_metadata", lambda *_args, **_kwargs: 1)
+        res = _run(document_api_module.update_metadata.__wrapped__("ds-1"))
         assert res["code"] == 0
         assert res["data"]["updated"] == 1
         assert res["data"]["matched_docs"] == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`
around lines 419 - 481, The test still references the old module variable
(module.metadata_batch_update and module.* monkeypatches) which is no longer
initialized; replace those remaining uses with document_api_module equivalents
and call document_api_module.update_metadata.__wrapped__ instead of
module.metadata_batch_update.__wrapped__; also update monkeypatch.setattr
targets (e.g., module.KnowledgebaseService, module.meta_filter,
module.DocMetadataService) to document_api_module.KnowledgebaseService,
document_api_module.meta_filter, document_api_module.DocMetadataService so the
test patches and calls exercise document_api_module.update_metadata correctly.
web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts (1)

434-455: ⚠️ Potential issue | 🟡 Minor

Include id in the callback dependencies.

handleSaveSingleFileSettings reads id on line 440 but the dependency list on line 454 omits it. This violates the exhaustive-deps rule and can cause a stale closure if the dataset ID changes.

Proposed fix
-    [tableData, t, otherData],
+    [tableData, t, otherData, id],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts` around
lines 434 - 455, The callback handleSaveSingleFileSettings captures the outer
variable id but the dependency array only includes [tableData, t, otherData],
which can create a stale closure; update the dependency array for useCallback
that wraps handleSaveSingleFileSettings to include id (i.e., [tableData, t,
otherData, id]) so the function re-creates when the dataset id changes and
continues to call updateDocumentMetaDataConfig with the correct kb_id.
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

179-190: ⚠️ Potential issue | 🟠 Major

Update the remaining legacy helper calls at lines 181 and 188.

These test calls pass dictionaries as the second argument, but the helpers now require dataset_id as a separate positional parameter. The calls will fail with TypeError before reaching their assertions.

  • Line 181: document_infos() expects (auth, dataset_id, params=None, payload=None, ...) but receives (auth, dict)
  • Line 188: document_update_metadata_setting() expects (auth, dataset_id, doc_id, payload=None, ...) but receives (auth, dict)

Additionally, line 188 unpacks only one value from add_document_func, but the fixture now returns two values (dataset_id, doc_id).

Proposed fix
     `@pytest.mark.p3`
     def test_infos_invalid_doc_id(self, WebApiAuth):
-        res = document_infos(WebApiAuth, {"doc_ids": ["invalid_id"]})
+        res = document_infos(WebApiAuth, "kb_id", {"ids": ["invalid_id"]})
         assert res["code"] == 109, res
         assert "No authorization" in res["message"], res
 
     `@pytest.mark.p3`
     def test_update_metadata_setting_missing_metadata(self, WebApiAuth, add_document_func):
-        _, doc_id = add_document_func
-        res = document_update_metadata_setting(WebApiAuth, {"doc_id": doc_id})
+        dataset_id, doc_id = add_document_func
+        res = document_update_metadata_setting(WebApiAuth, dataset_id, doc_id, {})
         assert res["code"] == 101, res
         assert "required argument are missing" in res["message"], res
         assert "metadata" in res["message"], res
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 179 - 190, The failing tests call legacy helpers with a single dict
instead of the new signature; update test_infos_invalid_doc_id to call
document_infos(WebApiAuth, dataset_id, {"doc_ids": ["invalid_id"]}) by obtaining
dataset_id from the fixture (unpack add_document_func if needed) and update
test_update_metadata_setting to unpack add_document_func into (dataset_id,
doc_id) and call document_update_metadata_setting(WebApiAuth, dataset_id,
doc_id, payload=None or {} as appropriate) so dataset_id is passed as the
separate positional parameter required by document_infos and
document_update_metadata_setting.
🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (1)

862-893: Add safe flow logs for the new metadata APIs.

These new Python flows should log start/result/error context such as dataset_id, document_id, target count, and updated count, but not metadata values. As per coding guidelines, **/*.py: Add logging for new flows.

🪵 Proposed logging additions
 async def update_metadata_config(tenant_id, dataset_id, document_id):
+    logging.info("Updating document metadata config: dataset_id=%s document_id=%s", dataset_id, document_id)
     # Verify ownership and existence of dataset
     if not KnowledgebaseService.query(id=dataset_id, tenant_id=tenant_id):
         return get_error_data_result(message="You don't own the dataset.")
@@
     try:
         DocumentService.update_parser_config(doc.id, {"metadata": req["metadata"]})
     except Exception as e:
         logging.error("error when update_parser_config", exc_info=e)
         return get_json_result(code=RetCode.EXCEPTION_ERROR, message=repr(e))
@@
-    return get_result(data=doc.to_dict())
+    logging.info("Updated document metadata config: dataset_id=%s document_id=%s", dataset_id, document_id)
+    return get_result(data=doc.to_dict())
@@
 async def update_metadata(tenant_id, dataset_id):
+    logging.info("Batch updating document metadata: dataset_id=%s", dataset_id)
     # Verify ownership of dataset
     if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id):
         return get_error_data_result(message=f"You don't own the dataset {dataset_id}.")
@@
     target_doc_ids = list(target_doc_ids)
     updated = DocMetadataService.batch_update_metadata(dataset_id, target_doc_ids, updates, deletes)
+    logging.info(
+        "Batch updated document metadata: dataset_id=%s matched_docs=%s updated=%s",
+        dataset_id,
+        len(target_doc_ids),
+        updated,
+    )
     return get_result(data={"updated": updated, "matched_docs": len(target_doc_ids)})

Also applies to: 959-1021

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 862 - 893, The new
metadata update flow lacks structured safe logs; add logging calls (without
metadata values) at start, on error, and on successful result around the block
that verifies dataset/document and calls DocumentService.update_parser_config
and DocumentService.get_by_id: log dataset_id and document_id at entry, log
target count and updated count after update (e.g., before/after sizes or a
success flag), and on exceptions log dataset_id/document_id and the exception
message (but never log req["metadata"]); reference the existing symbols
KnowledgebaseService.query, DocumentService.query,
DocumentService.update_parser_config, and DocumentService.get_by_id when
inserting these logs.
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

257-269: Assert that the metadata was actually persisted.

This success test only checks code == 0, so it would miss a no-op implementation that returns success without updating the document metadata.

🧪 Proposed stronger assertion
         )
         assert res["code"] == 0, res
+        info_res = document_infos(WebApiAuth, kb_id, {"ids": [doc_id]})
+        assert info_res["code"] == 0, info_res
+        meta_fields = info_res["data"]["docs"][0].get("meta_fields", {})
+        assert meta_fields.get("author") == "test_author", info_res

Based on learnings: Applies to tests/**/*.py : Add/adjust tests for behavior changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 257 - 269, Extend test_update_metadata_success to verify
persistence: after calling document_metadata_update(WebApiAuth, kb_id, ...),
fetch the document metadata for doc_id (e.g., via the existing document metadata
read endpoint or document read helper) and assert the 'author' key equals
"test_author" (or that the metadata contains that key/value). Use the same kb_id
and doc_id from add_document_func to locate the record and fail the test if the
persisted metadata doesn't match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1008-1016: The metadata-only selector path incorrectly intersects
filtered_ids with an initially empty target_doc_ids, producing no matches;
update the logic in document_api.py so that when metadata_condition is present
and target_doc_ids is empty, you set target_doc_ids = filtered_ids (instead of
intersecting), otherwise keep the intersection; use
DocMetadataService.get_flatted_meta_by_kbs, meta_filter, and convert_conditions
as currently used to compute filtered_ids, and preserve the existing early
return that calls get_result({"updated": 0, "matched_docs": 0}) when
metadata_condition has conditions and target_doc_ids becomes empty after
filtering.

---

Outside diff comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`:
- Around line 419-481: The test still references the old module variable
(module.metadata_batch_update and module.* monkeypatches) which is no longer
initialized; replace those remaining uses with document_api_module equivalents
and call document_api_module.update_metadata.__wrapped__ instead of
module.metadata_batch_update.__wrapped__; also update monkeypatch.setattr
targets (e.g., module.KnowledgebaseService, module.meta_filter,
module.DocMetadataService) to document_api_module.KnowledgebaseService,
document_api_module.meta_filter, document_api_module.DocMetadataService so the
test patches and calls exercise document_api_module.update_metadata correctly.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 179-190: The failing tests call legacy helpers with a single dict
instead of the new signature; update test_infos_invalid_doc_id to call
document_infos(WebApiAuth, dataset_id, {"doc_ids": ["invalid_id"]}) by obtaining
dataset_id from the fixture (unpack add_document_func if needed) and update
test_update_metadata_setting to unpack add_document_func into (dataset_id,
doc_id) and call document_update_metadata_setting(WebApiAuth, dataset_id,
doc_id, payload=None or {} as appropriate) so dataset_id is passed as the
separate positional parameter required by document_infos and
document_update_metadata_setting.

In `@web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts`:
- Around line 434-455: The callback handleSaveSingleFileSettings captures the
outer variable id but the dependency array only includes [tableData, t,
otherData], which can create a stale closure; update the dependency array for
useCallback that wraps handleSaveSingleFileSettings to include id (i.e.,
[tableData, t, otherData, id]) so the function re-creates when the dataset id
changes and continues to call updateDocumentMetaDataConfig with the correct
kb_id.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 862-893: The new metadata update flow lacks structured safe logs;
add logging calls (without metadata values) at start, on error, and on
successful result around the block that verifies dataset/document and calls
DocumentService.update_parser_config and DocumentService.get_by_id: log
dataset_id and document_id at entry, log target count and updated count after
update (e.g., before/after sizes or a success flag), and on exceptions log
dataset_id/document_id and the exception message (but never log
req["metadata"]); reference the existing symbols KnowledgebaseService.query,
DocumentService.query, DocumentService.update_parser_config, and
DocumentService.get_by_id when inserting these logs.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 257-269: Extend test_update_metadata_success to verify
persistence: after calling document_metadata_update(WebApiAuth, kb_id, ...),
fetch the document metadata for doc_id (e.g., via the existing document metadata
read endpoint or document read helper) and assert the 'author' key equals
"test_author" (or that the metadata contains that key/value). Use the same kb_id
and doc_id from add_document_func to locate the record and fail the test if the
persisted metadata doesn't match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ae7c72cf-fe31-4f30-8bee-4829ecbf37da

📥 Commits

Reviewing files that changed from the base of the PR and between f853a39 and a4d33d7.

📒 Files selected for processing (11)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • api/apps/sdk/doc.py
  • test/testcases/test_http_api/common.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (2)
  • api/apps/sdk/doc.py
  • api/apps/document_app.py

Comment on lines +1008 to +1016
# Apply metadata_condition filtering if provided
if metadata_condition:
metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id])
filtered_ids = set(
meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
)
target_doc_ids = target_doc_ids & filtered_ids
if metadata_condition.get("conditions") and not target_doc_ids:
return get_result(data={"updated": 0, "matched_docs": 0})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle metadata_condition without document_ids.

When the selector only contains metadata_condition, target_doc_ids is still an empty set, so Line 1014 turns every filtered_ids result into an empty match. Use the filtered set as the initial target when no explicit document IDs were provided.

🐛 Proposed fix
     if metadata_condition:
         metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id])
         filtered_ids = set(
             meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
         )
-        target_doc_ids = target_doc_ids & filtered_ids
+        target_doc_ids = target_doc_ids & filtered_ids if target_doc_ids else filtered_ids
         if metadata_condition.get("conditions") and not target_doc_ids:
             return get_result(data={"updated": 0, "matched_docs": 0})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Apply metadata_condition filtering if provided
if metadata_condition:
metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id])
filtered_ids = set(
meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
)
target_doc_ids = target_doc_ids & filtered_ids
if metadata_condition.get("conditions") and not target_doc_ids:
return get_result(data={"updated": 0, "matched_docs": 0})
# Apply metadata_condition filtering if provided
if metadata_condition:
metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id])
filtered_ids = set(
meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
)
target_doc_ids = target_doc_ids & filtered_ids if target_doc_ids else filtered_ids
if metadata_condition.get("conditions") and not target_doc_ids:
return get_result(data={"updated": 0, "matched_docs": 0})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1008 - 1016, The
metadata-only selector path incorrectly intersects filtered_ids with an
initially empty target_doc_ids, producing no matches; update the logic in
document_api.py so that when metadata_condition is present and target_doc_ids is
empty, you set target_doc_ids = filtered_ids (instead of intersecting),
otherwise keep the intersection; use DocMetadataService.get_flatted_meta_by_kbs,
meta_filter, and convert_conditions as currently used to compute filtered_ids,
and preserve the existing early return that calls get_result({"updated": 0,
"matched_docs": 0}) when metadata_condition has conditions and target_doc_ids
becomes empty after filtering.

@xugangqiang xugangqiang changed the title WIP Doc metadata update WIP Refactor: Doc metadata update Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)

109-387: Class-scoped dataset is shared across state-mutating tests; validation tests pay the upload cost for no reason.

Two design concerns with the new dataset_with_docs usage:

  1. TestMetadataBatchUpdateSuccess tests all mutate metadata on the same 5 docs (author/status/category keys). Each test currently re-seeds its own pre-state so in-file order works, but this is fragile under pytest-xdist, pytest-randomly, or selective reruns (e.g. running test_batch_delete_metadata alone after another test already removed status). Prefer a function-scoped fixture that uploads fresh docs (or resets metadata) per test.
  2. TestMetadataBatchUpdateValidation pulls in dataset_with_docs for tests that only need a valid dataset_id (selector/updates/deletes shape checks). This costs 5 uploads per test unnecessarily. Consider a lightweight empty_dataset fixture for validation-only cases and reserve document seeding for the success suite.
Sketch
`@pytest.fixture`
def empty_dataset(HttpApiAuth, add_dataset):
    return add_dataset  # just the dataset_id, no docs

`@pytest.fixture`
def dataset_with_docs(HttpApiAuth, add_dataset, ragflow_tmp_dir):
    dataset_id = add_dataset
    fps = []
    for i in range(5):
        fp = ragflow_tmp_dir / f"test_doc_{i}.txt"
        fp.write_text(f"Test document content {i}\n" * 10)
        fps.append(fp)
    upload_res = upload_documents(HttpApiAuth, dataset_id, fps)
    assert upload_res["code"] == 0
    return dataset_id, [doc["id"] for doc in upload_res["data"]]

Then switch validation tests to empty_dataset and keep success tests on a function-scoped dataset_with_docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`
around lines 109 - 387, The test suite is using a class-scoped dataset_with_docs
for both validation and mutable success tests causing shared state and
unnecessary uploads; change fixtures and test usages so validation tests use a
lightweight function-scoped empty_dataset (returns only dataset_id) and success
tests use a function-scoped dataset_with_docs that seeds 5 fresh documents per
test (or resets metadata) before each test. Specifically, add an empty_dataset
fixture that returns add_dataset without uploading files, modify
dataset_with_docs to create and upload 5 docs per call (ensure
upload_documents(HttpApiAuth, dataset_id, fps) is asserted and returns ids), and
update TestMetadataBatchUpdateValidation to depend on empty_dataset instead of
dataset_with_docs while keeping TestMetadataBatchUpdateSuccess using the new
function-scoped dataset_with_docs to avoid cross-test interference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 109-387: The test suite is using a class-scoped dataset_with_docs
for both validation and mutable success tests causing shared state and
unnecessary uploads; change fixtures and test usages so validation tests use a
lightweight function-scoped empty_dataset (returns only dataset_id) and success
tests use a function-scoped dataset_with_docs that seeds 5 fresh documents per
test (or resets metadata) before each test. Specifically, add an empty_dataset
fixture that returns add_dataset without uploading files, modify
dataset_with_docs to create and upload 5 docs per call (ensure
upload_documents(HttpApiAuth, dataset_id, fps) is asserted and returns ids), and
update TestMetadataBatchUpdateValidation to depend on empty_dataset instead of
dataset_with_docs while keeping TestMetadataBatchUpdateSuccess using the new
function-scoped dataset_with_docs to avoid cross-test interference.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f5ddd10-5db7-46e8-920c-7818d2e49bbf

📥 Commits

Reviewing files that changed from the base of the PR and between a4d33d7 and 91c6363.

📒 Files selected for processing (2)
  • api/apps/document_app.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py
💤 Files with no reviewable changes (1)
  • api/apps/document_app.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)

92-92: ⚠️ Potential issue | 🟡 Minor

Fix the stale expected-count message.

The assertion checks 1010, but the failure message says 1100, which makes failures misleading.

Proposed fix
-        assert res["data"]["updated"] == 1010, f"Expected 1100 documents updated, got {res['data']['updated']}"
+        assert res["data"]["updated"] == 1010, f"Expected 1010 documents updated, got {res['data']['updated']}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`
at line 92, The assertion compares res["data"]["updated"] to 1010 but the
failure message incorrectly states "Expected 1100 documents updated"; update the
failure message to reflect 1010 (or change the asserted expected value if 1100
is correct) so the assertion and its message match — locate the assertion using
res["data"]["updated"] in test_metadata_batch_update.py and make the message
consistent with the asserted value.
🧹 Nitpick comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)

254-287: Make the metadata-condition success case prove filtering.

This test gives every selected document category=test_category, so a bug that updates all selected IDs whenever the condition exists could still pass. Seed only a subset, assert subset counts, and verify non-matching docs stay unchanged.

Proposed stronger test shape
-        # First, set initial metadata
+        # First, set initial metadata only on a subset so the filter is observable.
+        matching_ids = document_ids[:2]
+        non_matching_ids = document_ids[2:]
         updates = [{"key": "category", "value": "test_category"}]
         res = update_documents_metadata(
             HttpApiAuth,
             dataset_id,
-            {"selector": {"document_ids": document_ids}, "updates": updates, "deletes": []},
+            {"selector": {"document_ids": matching_ids}, "updates": updates, "deletes": []},
         )
 
         assert res["code"] == 0
-        assert res["data"]["updated"] == 5
-        assert res["data"]["matched_docs"] == 5
+        assert res["data"]["updated"] == len(matching_ids)
+        assert res["data"]["matched_docs"] == len(matching_ids)
 
         # Now update only documents with category="test_category"
         updates = [{"key": "author", "value": "filtered_author"}]
         res = update_documents_metadata(
@@
 
         assert res["code"] == 0, f"Expected code 0, got {res.get('code')}: {res.get('message')}"
-        assert res["data"]["updated"] == 5
-        assert res["data"]["matched_docs"] == 5
+        assert res["data"]["updated"] == len(matching_ids)
+        assert res["data"]["matched_docs"] == len(matching_ids)
+
+        list_res = list_documents(HttpApiAuth, dataset_id, {"ids": document_ids})
+        assert list_res["code"] == 0
+        docs_by_id = {doc["id"]: doc for doc in list_res["data"]["docs"]}
+        for doc_id in matching_ids:
+            assert docs_by_id[doc_id]["meta_fields"].get("author") == "filtered_author"
+        for doc_id in non_matching_ids:
+            assert docs_by_id[doc_id]["meta_fields"].get("author") != "filtered_author"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`
around lines 254 - 287, Change the test_batch_update_with_metadata_condition to
actually prove filtering by seeding the "category" metadata on only a subset of
document_ids (e.g., first N of the returned document_ids from dataset_with_docs)
instead of all documents, then call update_documents_metadata with the same
metadata_condition and assert that res["data"]["updated"] and
res["data"]["matched_docs"] equal that subset size (not the full set) and that
the remaining documents do not have the new "author" metadata; keep references
to HttpApiAuth, dataset_with_docs, update_documents_metadata and document_ids to
locate the code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 103-104: The cleanup call delete_documents(HttpApiAuth,
dataset_id, {"ids": document_ids}) is executed directly so if an assertion fails
earlier the uploaded 1,010 documents won’t be removed; change this to a
failure-safe pattern by registering the cleanup via request.addfinalizer(lambda:
delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids})) or wrap the
verification block in try/finally and call delete_documents in the finally so
delete_documents, HttpApiAuth, dataset_id and document_ids are always cleaned up
(match the pattern used in dataset_with_docs).

---

Outside diff comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Line 92: The assertion compares res["data"]["updated"] to 1010 but the failure
message incorrectly states "Expected 1100 documents updated"; update the failure
message to reflect 1010 (or change the asserted expected value if 1100 is
correct) so the assertion and its message match — locate the assertion using
res["data"]["updated"] in test_metadata_batch_update.py and make the message
consistent with the asserted value.

---

Nitpick comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 254-287: Change the test_batch_update_with_metadata_condition to
actually prove filtering by seeding the "category" metadata on only a subset of
document_ids (e.g., first N of the returned document_ids from dataset_with_docs)
instead of all documents, then call update_documents_metadata with the same
metadata_condition and assert that res["data"]["updated"] and
res["data"]["matched_docs"] equal that subset size (not the full set) and that
the remaining documents do not have the new "author" metadata; keep references
to HttpApiAuth, dataset_with_docs, update_documents_metadata and document_ids to
locate the code to modify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4fef31f-2fdd-4227-ae3d-579bfe50b0ed

📥 Commits

Reviewing files that changed from the base of the PR and between bb00e96 and ce4de54.

📒 Files selected for processing (4)
  • test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/utils/api.ts
  • web/src/services/knowledge-service.ts

Comment on lines 103 to 104
# Cleanup
delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make bulk-upload cleanup failure-safe.

If any assertion above fails, these 1,010 uploaded documents are not deleted. Register cleanup with request.addfinalizer or wrap the verification block in try/finally, matching the safer pattern used by dataset_with_docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`
around lines 103 - 104, The cleanup call delete_documents(HttpApiAuth,
dataset_id, {"ids": document_ids}) is executed directly so if an assertion fails
earlier the uploaded 1,010 documents won’t be removed; change this to a
failure-safe pattern by registering the cleanup via request.addfinalizer(lambda:
delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids})) or wrap the
verification block in try/finally and call delete_documents in the finally so
delete_documents, HttpApiAuth, dataset_id and document_ids are always cleaned up
(match the pattern used in dataset_with_docs).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.97%. Comparing base (c08cd8e) to head (2aeb041).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14289      +/-   ##
==========================================
+ Coverage   95.25%   95.97%   +0.71%     
==========================================
  Files          10       10              
  Lines         695      695              
  Branches      111      111              
==========================================
+ Hits          662      667       +5     
+ Misses         15       11       -4     
+ Partials       18       17       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xugangqiang xugangqiang changed the title WIP Refactor: Doc metadata update Refactor: Doc metadata update Apr 23, 2026
@xugangqiang xugangqiang self-assigned this Apr 23, 2026
@yingfeng yingfeng merged commit dbf8c6e into infiniflow:main Apr 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' ci Continue Integration size:L This PR changes 100-499 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants