Skip to content

Commit 9546df3

Browse files
authored
Update minhash-lsh-in-milvus-the-secret-weapon-for-fighting-duplicates-in-llm-training-data.md
1 parent 8e2e3a1 commit 9546df3

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

blog/en/minhash-lsh-in-milvus-the-secret-weapon-for-fighting-duplicates-in-llm-training-data.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Large Language Models (LLMs) have transformed the AI landscape with their abilit
2020

2121
The challenge is that raw training data often contains significant redundancy. It's like teaching a child by repeating the same lessons over and over while skipping other important topics. A large AI company approached us with precisely this problem - they were building an ambitious new language model but struggled with deduplicating tens of billions of documents. Traditional matching methods couldn't scale to this volume, and specialized deduplication tools required massive computational resources, making them economically unviable.
2222

23-
To solve this problem, our solution is: MinHash LSH (Locality Sensitive Hashing) indexing, which will be available in Milvus 2.6. This article will explore how MinHash LSH efficiently solves the data deduplication problem for LLM training.
23+
To solve this problem, we introduced MinHash LSH (Locality Sensitive Hashing) indexing in Milvus 2.6. This article will explore how MinHash LSH efficiently solves the data deduplication problem for LLM training.
2424

2525

2626
![](https://assets.zilliz.com/Chat_GPT_Image_May_16_2025_09_46_39_PM_1f3290ce5e.png)
@@ -280,4 +280,14 @@ The returned results are **candidate near-duplicates**. To form complete dedupli
280280

281281
MinHash LSH in Milvus 2.6 is a leap forward in AI data processing. What started as a solution for LLM data deduplication now opens doors to broader use cases—web content cleanup, catalog management, plagiarism detection, and more.
282282

283-
If you have a similar use case, please reach out to us on the [Milvus Discord](https://discord.com/invite/8uyFbECzPX) to sign up for an [Office Hour meeting](https://meetings.hubspot.com/chloe-williams1/milvus-office-hour)
283+
## Getting Started with Milvus 2.6
284+
285+
Milvus 2.6 is available now. In addition to MinHash LSH, it introduces dozens of new features and performance optimizations such as tiered storage, RabbitQ quantization method, and enhanced full-text search and multitenancy, directly addressing the most pressing challenges in vector search today: scaling efficiently while keeping costs under control.
286+
287+
Ready to explore everything Milvus offers? Dive into our[ release notes](https://milvus.io/docs/release_notes.md), browse the[ complete documentation](https://milvus.io/docs), or check out our[ feature blogs](https://milvus.io/blog)
288+
289+
If you have any questions or have a similar use case, feel free to reach out to us through our [Discord community](https://discord.com/invite/8uyFbECzPX) or file an issue on[ GitHub](https://github.com/milvus-io/milvus) — we're here to help you make the most of Milvus 2.6.
290+
291+
292+
293+

0 commit comments

Comments
 (0)