feat(client): add MinHash deduplication methods to MilvusClient#3257
feat(client): add MinHash deduplication methods to MilvusClient#3257zhuwenxing wants to merge 1 commit intomilvus-io:masterfrom
Conversation
Add two deduplication methods for collections with MinHash function: - self_deduplicate: deduplicate documents within a collection - deduplicate: check new documents against existing collection Both methods require the collection to have a MinHash function configured. Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhuwenxing The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3257 +/- ##
==========================================
- Coverage 76.32% 75.91% -0.42%
==========================================
Files 63 63
Lines 13230 13306 +76
==========================================
+ Hits 10098 10101 +3
- Misses 3132 3205 +73 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/hold |
Summary
self_deduplicate()method to deduplicate documents within a collection based on MinHash similaritydeduplicate()method to check new documents against existing collection for duplicates_get_minhash_fields()to extract MinHash function field namesBoth methods require the collection to have a MinHash function configured.
Test plan
_get_minhash_fields()helper methodself_deduplicate()with mock datadeduplicate()with mock data