Skip to content

Conversation

@zhuwenxing
Copy link
Contributor

Summary

  • Add self_deduplicate() method to deduplicate documents within a collection based on MinHash similarity
  • Add deduplicate() method to check new documents against existing collection for duplicates
  • Add helper method _get_minhash_fields() to extract MinHash function field names

Both methods require the collection to have a MinHash function configured.

Test plan

  • Unit tests for _get_minhash_fields() helper method
  • Unit tests for self_deduplicate() with mock data
  • Unit tests for deduplicate() with mock data
  • Integration test with actual Milvus instance

Add two deduplication methods for collections with MinHash function:
- self_deduplicate: deduplicate documents within a collection
- deduplicate: check new documents against existing collection

Both methods require the collection to have a MinHash function configured.

Signed-off-by: zhuwenxing <[email protected]>
@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhuwenxing
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 3.94737% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.91%. Comparing base (2a2e898) to head (b3576bc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pymilvus/milvus_client/milvus_client.py 3.94% 73 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3257      +/-   ##
==========================================
- Coverage   76.32%   75.91%   -0.42%     
==========================================
  Files          63       63              
  Lines       13230    13306      +76     
==========================================
+ Hits        10098    10101       +3     
- Misses       3132     3205      +73     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhuwenxing
Copy link
Contributor Author

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants