Refactor: Doc metadata update#14289
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved legacy document-app POST handlers and SDK endpoint; added dataset-scoped RESTful endpoints for batch metadata updates (PATCH) and per-document parser config (PUT); updated tests, frontend API surface, and UI hooks to call the new routes and request shapes. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend
participant REST_API
participant KB_Service
participant Doc_Service
participant DocMetadata_Service
Client->>Frontend: trigger metadata save
Frontend->>REST_API: PATCH /datasets/{dataset_id}/documents/metadatas {selector, updates, deletes}
REST_API->>KB_Service: verify dataset ownership / accessible(dataset_id, tenant)
alt dataset accessible
REST_API->>Doc_Service: list/filter documents (document_ids or metadata_condition)
Doc_Service->>Doc_Service: fetch & flatten metadata for filtering
REST_API->>DocMetadata_Service: batch_update_metadata(dataset_id, selector, updates, deletes)
DocMetadata_Service-->>REST_API: {updated, matched_docs}
REST_API-->>Frontend: 200 {updated, matched_docs}
Frontend-->>Client: show success
else not accessible
KB_Service-->>REST_API: error
REST_API-->>Frontend: 4xx error
Frontend-->>Client: show error
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
517f8b9 to
a4d33d7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py (1)
419-481:⚠️ Potential issue | 🔴 CriticalReplace the remaining stale
module.metadata_batch_updatecalls.
moduleis no longer initialized in this test, so Line 424 and the later monkeypatches/calls will fail withNameErrorinstead of testingdocument_api_module.update_metadata.🐛 Proposed fix
monkeypatch.setattr( document_api_module, "get_request_json", lambda: _AwaitableValue({"selector": {"document_ids": "doc-1"}, "updates": [], "deletes": []}), ) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert res["message"] == "document_ids must be a list." monkeypatch.setattr( - module, + document_api_module, "get_request_json", lambda: _AwaitableValue({"selector": {}, "updates": [{"key": ""}], "deletes": []}), ) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert "Each update requires key and value." in res["message"] monkeypatch.setattr( - module, + document_api_module, "get_request_json", lambda: _AwaitableValue({"selector": {}, "updates": [], "deletes": [{"x": "y"}]}), ) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert "Each delete requires key." in res["message"] monkeypatch.setattr( - module, + document_api_module, "get_request_json", lambda: _AwaitableValue( { @@ } ), ) - monkeypatch.setattr(module.KnowledgebaseService, "list_documents_by_ids", lambda _ids: ["doc-1"]) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + monkeypatch.setattr(document_api_module.KnowledgebaseService, "list_documents_by_ids", lambda _ids: ["doc-1"]) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert "do not belong to dataset" in res["message"] monkeypatch.setattr( - module, + document_api_module, "get_request_json", lambda: _AwaitableValue( { @@ } ), ) - monkeypatch.setattr(module, "meta_filter", lambda *_args, **_kwargs: []) - monkeypatch.setattr(module.DocMetadataService, "get_flatted_meta_by_kbs", lambda _kbs: []) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + monkeypatch.setattr(document_api_module, "meta_filter", lambda *_args, **_kwargs: []) + monkeypatch.setattr(document_api_module.DocMetadataService, "get_flatted_meta_by_kbs", lambda _kbs: []) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert res["code"] == 0 assert res["data"]["updated"] == 0 assert res["data"]["matched_docs"] == 0 - monkeypatch.setattr(module, "meta_filter", lambda *_args, **_kwargs: ["doc-1"]) - monkeypatch.setattr(module.DocMetadataService, "batch_update_metadata", lambda *_args, **_kwargs: 1) - res = _run(module.metadata_batch_update.__wrapped__("ds-1", "tenant-1")) + monkeypatch.setattr(document_api_module, "meta_filter", lambda *_args, **_kwargs: ["doc-1"]) + monkeypatch.setattr(document_api_module.DocMetadataService, "batch_update_metadata", lambda *_args, **_kwargs: 1) + res = _run(document_api_module.update_metadata.__wrapped__("ds-1")) assert res["code"] == 0 assert res["data"]["updated"] == 1 assert res["data"]["matched_docs"] == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py` around lines 419 - 481, The test still references the old module variable (module.metadata_batch_update and module.* monkeypatches) which is no longer initialized; replace those remaining uses with document_api_module equivalents and call document_api_module.update_metadata.__wrapped__ instead of module.metadata_batch_update.__wrapped__; also update monkeypatch.setattr targets (e.g., module.KnowledgebaseService, module.meta_filter, module.DocMetadataService) to document_api_module.KnowledgebaseService, document_api_module.meta_filter, document_api_module.DocMetadataService so the test patches and calls exercise document_api_module.update_metadata correctly.web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts (1)
434-455:⚠️ Potential issue | 🟡 MinorInclude
idin the callback dependencies.
handleSaveSingleFileSettingsreadsidon line 440 but the dependency list on line 454 omits it. This violates the exhaustive-deps rule and can cause a stale closure if the dataset ID changes.Proposed fix
- [tableData, t, otherData], + [tableData, t, otherData, id],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts` around lines 434 - 455, The callback handleSaveSingleFileSettings captures the outer variable id but the dependency array only includes [tableData, t, otherData], which can create a stale closure; update the dependency array for useCallback that wraps handleSaveSingleFileSettings to include id (i.e., [tableData, t, otherData, id]) so the function re-creates when the dataset id changes and continues to call updateDocumentMetaDataConfig with the correct kb_id.test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
179-190:⚠️ Potential issue | 🟠 MajorUpdate the remaining legacy helper calls at lines 181 and 188.
These test calls pass dictionaries as the second argument, but the helpers now require
dataset_idas a separate positional parameter. The calls will fail withTypeErrorbefore reaching their assertions.
- Line 181:
document_infos()expects(auth, dataset_id, params=None, payload=None, ...)but receives(auth, dict)- Line 188:
document_update_metadata_setting()expects(auth, dataset_id, doc_id, payload=None, ...)but receives(auth, dict)Additionally, line 188 unpacks only one value from
add_document_func, but the fixture now returns two values (dataset_id, doc_id).Proposed fix
`@pytest.mark.p3` def test_infos_invalid_doc_id(self, WebApiAuth): - res = document_infos(WebApiAuth, {"doc_ids": ["invalid_id"]}) + res = document_infos(WebApiAuth, "kb_id", {"ids": ["invalid_id"]}) assert res["code"] == 109, res assert "No authorization" in res["message"], res `@pytest.mark.p3` def test_update_metadata_setting_missing_metadata(self, WebApiAuth, add_document_func): - _, doc_id = add_document_func - res = document_update_metadata_setting(WebApiAuth, {"doc_id": doc_id}) + dataset_id, doc_id = add_document_func + res = document_update_metadata_setting(WebApiAuth, dataset_id, doc_id, {}) assert res["code"] == 101, res assert "required argument are missing" in res["message"], res assert "metadata" in res["message"], res🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 179 - 190, The failing tests call legacy helpers with a single dict instead of the new signature; update test_infos_invalid_doc_id to call document_infos(WebApiAuth, dataset_id, {"doc_ids": ["invalid_id"]}) by obtaining dataset_id from the fixture (unpack add_document_func if needed) and update test_update_metadata_setting to unpack add_document_func into (dataset_id, doc_id) and call document_update_metadata_setting(WebApiAuth, dataset_id, doc_id, payload=None or {} as appropriate) so dataset_id is passed as the separate positional parameter required by document_infos and document_update_metadata_setting.
🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (1)
862-893: Add safe flow logs for the new metadata APIs.These new Python flows should log start/result/error context such as
dataset_id,document_id, target count, and updated count, but not metadata values. As per coding guidelines,**/*.py: Add logging for new flows.🪵 Proposed logging additions
async def update_metadata_config(tenant_id, dataset_id, document_id): + logging.info("Updating document metadata config: dataset_id=%s document_id=%s", dataset_id, document_id) # Verify ownership and existence of dataset if not KnowledgebaseService.query(id=dataset_id, tenant_id=tenant_id): return get_error_data_result(message="You don't own the dataset.") @@ try: DocumentService.update_parser_config(doc.id, {"metadata": req["metadata"]}) except Exception as e: logging.error("error when update_parser_config", exc_info=e) return get_json_result(code=RetCode.EXCEPTION_ERROR, message=repr(e)) @@ - return get_result(data=doc.to_dict()) + logging.info("Updated document metadata config: dataset_id=%s document_id=%s", dataset_id, document_id) + return get_result(data=doc.to_dict()) @@ async def update_metadata(tenant_id, dataset_id): + logging.info("Batch updating document metadata: dataset_id=%s", dataset_id) # Verify ownership of dataset if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id): return get_error_data_result(message=f"You don't own the dataset {dataset_id}.") @@ target_doc_ids = list(target_doc_ids) updated = DocMetadataService.batch_update_metadata(dataset_id, target_doc_ids, updates, deletes) + logging.info( + "Batch updated document metadata: dataset_id=%s matched_docs=%s updated=%s", + dataset_id, + len(target_doc_ids), + updated, + ) return get_result(data={"updated": updated, "matched_docs": len(target_doc_ids)})Also applies to: 959-1021
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 862 - 893, The new metadata update flow lacks structured safe logs; add logging calls (without metadata values) at start, on error, and on successful result around the block that verifies dataset/document and calls DocumentService.update_parser_config and DocumentService.get_by_id: log dataset_id and document_id at entry, log target count and updated count after update (e.g., before/after sizes or a success flag), and on exceptions log dataset_id/document_id and the exception message (but never log req["metadata"]); reference the existing symbols KnowledgebaseService.query, DocumentService.query, DocumentService.update_parser_config, and DocumentService.get_by_id when inserting these logs.test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
257-269: Assert that the metadata was actually persisted.This success test only checks
code == 0, so it would miss a no-op implementation that returns success without updating the document metadata.🧪 Proposed stronger assertion
) assert res["code"] == 0, res + info_res = document_infos(WebApiAuth, kb_id, {"ids": [doc_id]}) + assert info_res["code"] == 0, info_res + meta_fields = info_res["data"]["docs"][0].get("meta_fields", {}) + assert meta_fields.get("author") == "test_author", info_resBased on learnings: Applies to tests/**/*.py : Add/adjust tests for behavior changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 257 - 269, Extend test_update_metadata_success to verify persistence: after calling document_metadata_update(WebApiAuth, kb_id, ...), fetch the document metadata for doc_id (e.g., via the existing document metadata read endpoint or document read helper) and assert the 'author' key equals "test_author" (or that the metadata contains that key/value). Use the same kb_id and doc_id from add_document_func to locate the record and fail the test if the persisted metadata doesn't match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1008-1016: The metadata-only selector path incorrectly intersects
filtered_ids with an initially empty target_doc_ids, producing no matches;
update the logic in document_api.py so that when metadata_condition is present
and target_doc_ids is empty, you set target_doc_ids = filtered_ids (instead of
intersecting), otherwise keep the intersection; use
DocMetadataService.get_flatted_meta_by_kbs, meta_filter, and convert_conditions
as currently used to compute filtered_ids, and preserve the existing early
return that calls get_result({"updated": 0, "matched_docs": 0}) when
metadata_condition has conditions and target_doc_ids becomes empty after
filtering.
---
Outside diff comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`:
- Around line 419-481: The test still references the old module variable
(module.metadata_batch_update and module.* monkeypatches) which is no longer
initialized; replace those remaining uses with document_api_module equivalents
and call document_api_module.update_metadata.__wrapped__ instead of
module.metadata_batch_update.__wrapped__; also update monkeypatch.setattr
targets (e.g., module.KnowledgebaseService, module.meta_filter,
module.DocMetadataService) to document_api_module.KnowledgebaseService,
document_api_module.meta_filter, document_api_module.DocMetadataService so the
test patches and calls exercise document_api_module.update_metadata correctly.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 179-190: The failing tests call legacy helpers with a single dict
instead of the new signature; update test_infos_invalid_doc_id to call
document_infos(WebApiAuth, dataset_id, {"doc_ids": ["invalid_id"]}) by obtaining
dataset_id from the fixture (unpack add_document_func if needed) and update
test_update_metadata_setting to unpack add_document_func into (dataset_id,
doc_id) and call document_update_metadata_setting(WebApiAuth, dataset_id,
doc_id, payload=None or {} as appropriate) so dataset_id is passed as the
separate positional parameter required by document_infos and
document_update_metadata_setting.
In `@web/src/pages/dataset/components/metedata/hooks/use-manage-modal.ts`:
- Around line 434-455: The callback handleSaveSingleFileSettings captures the
outer variable id but the dependency array only includes [tableData, t,
otherData], which can create a stale closure; update the dependency array for
useCallback that wraps handleSaveSingleFileSettings to include id (i.e.,
[tableData, t, otherData, id]) so the function re-creates when the dataset id
changes and continues to call updateDocumentMetaDataConfig with the correct
kb_id.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 862-893: The new metadata update flow lacks structured safe logs;
add logging calls (without metadata values) at start, on error, and on
successful result around the block that verifies dataset/document and calls
DocumentService.update_parser_config and DocumentService.get_by_id: log
dataset_id and document_id at entry, log target count and updated count after
update (e.g., before/after sizes or a success flag), and on exceptions log
dataset_id/document_id and the exception message (but never log
req["metadata"]); reference the existing symbols KnowledgebaseService.query,
DocumentService.query, DocumentService.update_parser_config, and
DocumentService.get_by_id when inserting these logs.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 257-269: Extend test_update_metadata_success to verify
persistence: after calling document_metadata_update(WebApiAuth, kb_id, ...),
fetch the document metadata for doc_id (e.g., via the existing document metadata
read endpoint or document read helper) and assert the 'author' key equals
"test_author" (or that the metadata contains that key/value). Use the same kb_id
and doc_id from add_document_func to locate the record and fail the test if the
persisted metadata doesn't match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ae7c72cf-fe31-4f30-8bee-4829ecbf37da
📒 Files selected for processing (11)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pyapi/apps/sdk/doc.pytest/testcases/test_http_api/common.pytest/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.pytest/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/pages/dataset/components/metedata/hooks/use-manage-modal.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
💤 Files with no reviewable changes (2)
- api/apps/sdk/doc.py
- api/apps/document_app.py
| # Apply metadata_condition filtering if provided | ||
| if metadata_condition: | ||
| metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id]) | ||
| filtered_ids = set( | ||
| meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and")) | ||
| ) | ||
| target_doc_ids = target_doc_ids & filtered_ids | ||
| if metadata_condition.get("conditions") and not target_doc_ids: | ||
| return get_result(data={"updated": 0, "matched_docs": 0}) |
There was a problem hiding this comment.
Handle metadata_condition without document_ids.
When the selector only contains metadata_condition, target_doc_ids is still an empty set, so Line 1014 turns every filtered_ids result into an empty match. Use the filtered set as the initial target when no explicit document IDs were provided.
🐛 Proposed fix
if metadata_condition:
metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id])
filtered_ids = set(
meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
)
- target_doc_ids = target_doc_ids & filtered_ids
+ target_doc_ids = target_doc_ids & filtered_ids if target_doc_ids else filtered_ids
if metadata_condition.get("conditions") and not target_doc_ids:
return get_result(data={"updated": 0, "matched_docs": 0})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Apply metadata_condition filtering if provided | |
| if metadata_condition: | |
| metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id]) | |
| filtered_ids = set( | |
| meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and")) | |
| ) | |
| target_doc_ids = target_doc_ids & filtered_ids | |
| if metadata_condition.get("conditions") and not target_doc_ids: | |
| return get_result(data={"updated": 0, "matched_docs": 0}) | |
| # Apply metadata_condition filtering if provided | |
| if metadata_condition: | |
| metas = DocMetadataService.get_flatted_meta_by_kbs([dataset_id]) | |
| filtered_ids = set( | |
| meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and")) | |
| ) | |
| target_doc_ids = target_doc_ids & filtered_ids if target_doc_ids else filtered_ids | |
| if metadata_condition.get("conditions") and not target_doc_ids: | |
| return get_result(data={"updated": 0, "matched_docs": 0}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 1008 - 1016, The
metadata-only selector path incorrectly intersects filtered_ids with an
initially empty target_doc_ids, producing no matches; update the logic in
document_api.py so that when metadata_condition is present and target_doc_ids is
empty, you set target_doc_ids = filtered_ids (instead of intersecting),
otherwise keep the intersection; use DocMetadataService.get_flatted_meta_by_kbs,
meta_filter, and convert_conditions as currently used to compute filtered_ids,
and preserve the existing early return that calls get_result({"updated": 0,
"matched_docs": 0}) when metadata_condition has conditions and target_doc_ids
becomes empty after filtering.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)
109-387: Class-scoped dataset is shared across state-mutating tests; validation tests pay the upload cost for no reason.Two design concerns with the new
dataset_with_docsusage:
TestMetadataBatchUpdateSuccesstests all mutate metadata on the same 5 docs (author/status/category keys). Each test currently re-seeds its own pre-state so in-file order works, but this is fragile underpytest-xdist,pytest-randomly, or selective reruns (e.g. runningtest_batch_delete_metadataalone after another test already removedstatus). Prefer a function-scoped fixture that uploads fresh docs (or resets metadata) per test.TestMetadataBatchUpdateValidationpulls indataset_with_docsfor tests that only need a validdataset_id(selector/updates/deletes shape checks). This costs 5 uploads per test unnecessarily. Consider a lightweightempty_datasetfixture for validation-only cases and reserve document seeding for the success suite.Sketch
`@pytest.fixture` def empty_dataset(HttpApiAuth, add_dataset): return add_dataset # just the dataset_id, no docs `@pytest.fixture` def dataset_with_docs(HttpApiAuth, add_dataset, ragflow_tmp_dir): dataset_id = add_dataset fps = [] for i in range(5): fp = ragflow_tmp_dir / f"test_doc_{i}.txt" fp.write_text(f"Test document content {i}\n" * 10) fps.append(fp) upload_res = upload_documents(HttpApiAuth, dataset_id, fps) assert upload_res["code"] == 0 return dataset_id, [doc["id"] for doc in upload_res["data"]]Then switch validation tests to
empty_datasetand keep success tests on a function-scopeddataset_with_docs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py` around lines 109 - 387, The test suite is using a class-scoped dataset_with_docs for both validation and mutable success tests causing shared state and unnecessary uploads; change fixtures and test usages so validation tests use a lightweight function-scoped empty_dataset (returns only dataset_id) and success tests use a function-scoped dataset_with_docs that seeds 5 fresh documents per test (or resets metadata) before each test. Specifically, add an empty_dataset fixture that returns add_dataset without uploading files, modify dataset_with_docs to create and upload 5 docs per call (ensure upload_documents(HttpApiAuth, dataset_id, fps) is asserted and returns ids), and update TestMetadataBatchUpdateValidation to depend on empty_dataset instead of dataset_with_docs while keeping TestMetadataBatchUpdateSuccess using the new function-scoped dataset_with_docs to avoid cross-test interference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 109-387: The test suite is using a class-scoped dataset_with_docs
for both validation and mutable success tests causing shared state and
unnecessary uploads; change fixtures and test usages so validation tests use a
lightweight function-scoped empty_dataset (returns only dataset_id) and success
tests use a function-scoped dataset_with_docs that seeds 5 fresh documents per
test (or resets metadata) before each test. Specifically, add an empty_dataset
fixture that returns add_dataset without uploading files, modify
dataset_with_docs to create and upload 5 docs per call (ensure
upload_documents(HttpApiAuth, dataset_id, fps) is asserted and returns ids), and
update TestMetadataBatchUpdateValidation to depend on empty_dataset instead of
dataset_with_docs while keeping TestMetadataBatchUpdateSuccess using the new
function-scoped dataset_with_docs to avoid cross-test interference.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f5ddd10-5db7-46e8-920c-7818d2e49bbf
📒 Files selected for processing (2)
api/apps/document_app.pytest/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py
💤 Files with no reviewable changes (1)
- api/apps/document_app.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)
92-92:⚠️ Potential issue | 🟡 MinorFix the stale expected-count message.
The assertion checks
1010, but the failure message says1100, which makes failures misleading.Proposed fix
- assert res["data"]["updated"] == 1010, f"Expected 1100 documents updated, got {res['data']['updated']}" + assert res["data"]["updated"] == 1010, f"Expected 1010 documents updated, got {res['data']['updated']}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py` at line 92, The assertion compares res["data"]["updated"] to 1010 but the failure message incorrectly states "Expected 1100 documents updated"; update the failure message to reflect 1010 (or change the asserted expected value if 1100 is correct) so the assertion and its message match — locate the assertion using res["data"]["updated"] in test_metadata_batch_update.py and make the message consistent with the asserted value.
🧹 Nitpick comments (1)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py (1)
254-287: Make the metadata-condition success case prove filtering.This test gives every selected document
category=test_category, so a bug that updates all selected IDs whenever the condition exists could still pass. Seed only a subset, assert subset counts, and verify non-matching docs stay unchanged.Proposed stronger test shape
- # First, set initial metadata + # First, set initial metadata only on a subset so the filter is observable. + matching_ids = document_ids[:2] + non_matching_ids = document_ids[2:] updates = [{"key": "category", "value": "test_category"}] res = update_documents_metadata( HttpApiAuth, dataset_id, - {"selector": {"document_ids": document_ids}, "updates": updates, "deletes": []}, + {"selector": {"document_ids": matching_ids}, "updates": updates, "deletes": []}, ) assert res["code"] == 0 - assert res["data"]["updated"] == 5 - assert res["data"]["matched_docs"] == 5 + assert res["data"]["updated"] == len(matching_ids) + assert res["data"]["matched_docs"] == len(matching_ids) # Now update only documents with category="test_category" updates = [{"key": "author", "value": "filtered_author"}] res = update_documents_metadata( @@ assert res["code"] == 0, f"Expected code 0, got {res.get('code')}: {res.get('message')}" - assert res["data"]["updated"] == 5 - assert res["data"]["matched_docs"] == 5 + assert res["data"]["updated"] == len(matching_ids) + assert res["data"]["matched_docs"] == len(matching_ids) + + list_res = list_documents(HttpApiAuth, dataset_id, {"ids": document_ids}) + assert list_res["code"] == 0 + docs_by_id = {doc["id"]: doc for doc in list_res["data"]["docs"]} + for doc_id in matching_ids: + assert docs_by_id[doc_id]["meta_fields"].get("author") == "filtered_author" + for doc_id in non_matching_ids: + assert docs_by_id[doc_id]["meta_fields"].get("author") != "filtered_author"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py` around lines 254 - 287, Change the test_batch_update_with_metadata_condition to actually prove filtering by seeding the "category" metadata on only a subset of document_ids (e.g., first N of the returned document_ids from dataset_with_docs) instead of all documents, then call update_documents_metadata with the same metadata_condition and assert that res["data"]["updated"] and res["data"]["matched_docs"] equal that subset size (not the full set) and that the remaining documents do not have the new "author" metadata; keep references to HttpApiAuth, dataset_with_docs, update_documents_metadata and document_ids to locate the code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 103-104: The cleanup call delete_documents(HttpApiAuth,
dataset_id, {"ids": document_ids}) is executed directly so if an assertion fails
earlier the uploaded 1,010 documents won’t be removed; change this to a
failure-safe pattern by registering the cleanup via request.addfinalizer(lambda:
delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids})) or wrap the
verification block in try/finally and call delete_documents in the finally so
delete_documents, HttpApiAuth, dataset_id and document_ids are always cleaned up
(match the pattern used in dataset_with_docs).
---
Outside diff comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Line 92: The assertion compares res["data"]["updated"] to 1010 but the failure
message incorrectly states "Expected 1100 documents updated"; update the failure
message to reflect 1010 (or change the asserted expected value if 1100 is
correct) so the assertion and its message match — locate the assertion using
res["data"]["updated"] in test_metadata_batch_update.py and make the message
consistent with the asserted value.
---
Nitpick comments:
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`:
- Around line 254-287: Change the test_batch_update_with_metadata_condition to
actually prove filtering by seeding the "category" metadata on only a subset of
document_ids (e.g., first N of the returned document_ids from dataset_with_docs)
instead of all documents, then call update_documents_metadata with the same
metadata_condition and assert that res["data"]["updated"] and
res["data"]["matched_docs"] equal that subset size (not the full set) and that
the remaining documents do not have the new "author" metadata; keep references
to HttpApiAuth, dataset_with_docs, update_documents_metadata and document_ids to
locate the code to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c4fef31f-2fdd-4227-ae3d-579bfe50b0ed
📒 Files selected for processing (4)
test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/services/knowledge-service.tsweb/src/utils/api.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/testcases/test_web_api/test_document_app/test_document_metadata.py
- web/src/utils/api.ts
- web/src/services/knowledge-service.ts
| # Cleanup | ||
| delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids}) |
There was a problem hiding this comment.
Make bulk-upload cleanup failure-safe.
If any assertion above fails, these 1,010 uploaded documents are not deleted. Register cleanup with request.addfinalizer or wrap the verification block in try/finally, matching the safer pattern used by dataset_with_docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@test/testcases/test_http_api/test_file_management_within_dataset/test_metadata_batch_update.py`
around lines 103 - 104, The cleanup call delete_documents(HttpApiAuth,
dataset_id, {"ids": document_ids}) is executed directly so if an assertion fails
earlier the uploaded 1,010 documents won’t be removed; change this to a
failure-safe pattern by registering the cleanup via request.addfinalizer(lambda:
delete_documents(HttpApiAuth, dataset_id, {"ids": document_ids})) or wrap the
verification block in try/finally and call delete_documents in the finally so
delete_documents, HttpApiAuth, dataset_id and document_ids are always cleaned up
(match the pattern used in dataset_with_docs).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14289 +/- ##
==========================================
+ Coverage 95.25% 95.97% +0.71%
==========================================
Files 10 10
Lines 695 695
Branches 111 111
==========================================
+ Hits 662 667 +5
+ Misses 15 11 -4
+ Partials 18 17 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What problem does this PR solve?
Before migration
Web API: POST /v1/document/metadata/update
After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas
Type of change