Skip to content

Commit 7a259cb

Browse files
authored
Add files via upload (#358)
1 parent 1ac50db commit 7a259cb

File tree

1 file changed

+244
-0
lines changed

1 file changed

+244
-0
lines changed
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
---
2+
id: get-started-with-hybrid-semantic-full-text-search-with-milvus-2-5.md
3+
title: Getting Started with Hybrid Semantic / Full-Text Search with Milvus 2.5
4+
author: Stefan Webb
5+
date: 2024-12-17
6+
cover: assets.zilliz.com/Full_Text_Search_with_Milvus_2_5_7ba74461be.png
7+
tag: Engineering
8+
tags: Milvus
9+
recommend: false
10+
canonicalUrl: https://milvus.io/blog/get-started-with-hybrid-semantic-full-text-search-with-milvus-2-5.md
11+
---
12+
13+
In this article, we will show you how to quickly get up and running with the new full-text search feature and combine it with the conventional semantic search based on vector embeddings.
14+
15+
## Requirement
16+
17+
First, ensure you have installed Milvus 2.5:
18+
19+
20+
```
21+
pip install -U pymilvus[model]
22+
```
23+
24+
and have a running instance of Milvus Standalone (e.g. on your local machine) using the [installation instructions in the Milvus docs](https://milvus.io/docs/prerequisite-docker.md).
25+
26+
## Building the Data Schema and Search Indices
27+
28+
We import the required classes and functions:
29+
30+
```
31+
from pymilvus import MilvusClient, DataType, Function, FunctionType, model
32+
```
33+
34+
You may have noticed two new entries for Milvus 2.5, `Function` and `FunctionType`, which we will explain shortly.
35+
36+
Next we open the database with Milvus Standalone, that is, locally, and create the data schema. The schema comprises an integer primary key, a text string, a dense vector of dimension 384, and a sparse vector (of unlimited dimensionality).
37+
Note that Milvus Lite does not currently support full-text search, only Milvus Standalone and Milvus Distributed.
38+
39+
```
40+
client = MilvusClient(uri="http://localhost:19530")
41+
42+
schema = client.create_schema()
43+
44+
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
45+
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
46+
schema.add_field(field_name="dense", datatype=DataType.FLOAT_VECTOR, dim=768),
47+
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
48+
```
49+
50+
```
51+
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}], 'enable_dynamic_field': False}
52+
```
53+
54+
You may have noticed the `enable_analyzer=True` parameter. This tells Milvus 2.5 to enable the lexical parser on this field and build a list of tokens and token frequencies, which are required for full-text search. The `sparse` field will hold a vector representation of the documentation as a bag-of-words produced from the parsing `text`.
55+
56+
But how do we connect the `text` and `sparse` fields, and tell Milvus how `sparse` should be calculated from `text`? This is where we need to invoke the `Function` object and add it to the schema:
57+
58+
59+
```
60+
bm25_function = Function(
61+
name="text_bm25_emb", # Function name
62+
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
63+
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
64+
function_type=FunctionType.BM25,
65+
)
66+
67+
schema.add_function(bm25_function)
68+
```
69+
70+
71+
```
72+
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}], 'enable_dynamic_field': False, 'functions': [{'name': 'text_bm25_emb', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}
73+
```
74+
75+
The abstraction of the `Function` object is more general than that of applying full-text search. In the future, it may be used for other cases where one field needs to be a function of another field. In our case, we specify that `sparse` is a function of `text` via the function `FunctionType.BM25`. `BM25` refers to a common metric in information retrieval used for calculating a query's similarity to a document (relative to a collection of documents).
76+
77+
We use the default embedding model in Milvus, which is [paraphrase-albert-small-v2](https://huggingface.co/GPTCache/paraphrase-albert-small-v2):
78+
79+
80+
```
81+
embedding_fn = model.DefaultEmbeddingFunction()
82+
```
83+
84+
The next step is to add our search indices. We have one for the dense vector and a separate one for the sparse vector. The index type is `SPARSE_INVERTED_INDEX` with `BM25` since full-text search requires a different search method than those for standard dense vectors.
85+
86+
87+
```
88+
index_params = client.prepare_index_params()
89+
90+
index_params.add_index(
91+
field_name="dense",
92+
index_type="AUTOINDEX",
93+
metric_type="COSINE"
94+
)
95+
96+
index_params.add_index(
97+
field_name="sparse",
98+
index_type="SPARSE_INVERTED_INDEX",
99+
metric_type="BM25"
100+
)
101+
```
102+
103+
Finally, we create our collection:
104+
105+
```
106+
client.drop_collection('demo')
107+
client.list_collections()
108+
```
109+
110+
```
111+
[]
112+
```
113+
114+
```
115+
client.create_collection(
116+
collection_name='demo',
117+
schema=schema,
118+
index_params=index_params
119+
)
120+
121+
client.list_collections()
122+
```
123+
124+
125+
```
126+
['demo']
127+
```
128+
And with that, we have an empty database set up to accept text documents and perform semantic and full-text searches!
129+
130+
## Inserting Data and Performing Full-Text Search
131+
132+
Inserting data is no different than previous versions of Milvus:
133+
134+
```
135+
docs = [
136+
'information retrieval is a field of study.',
137+
'information retrieval focuses on finding relevant information in large datasets.',
138+
'data mining and information retrieval overlap in research.'
139+
]
140+
141+
embeddings = embedding_fn(docs)
142+
143+
client.insert('demo', [
144+
{'text': doc, 'dense': vec} for doc, vec in zip(docs, embeddings)
145+
])
146+
```
147+
148+
```
149+
{'insert_count': 3, 'ids': [454387371651630485, 454387371651630486, 454387371651630487], 'cost': 0}
150+
```
151+
152+
Let's first illustrate a full-text search before we move on to hybrid search:
153+
154+
```
155+
search_params = {
156+
'params': {'drop_ratio_search': 0.2},
157+
}
158+
159+
results = client.search(
160+
collection_name='demo',
161+
data=['whats the focus of information retrieval?'],
162+
output_fields=['text'],
163+
anns_field='sparse',
164+
limit=3,
165+
search_params=search_params
166+
)
167+
```
168+
169+
The search parameter `drop_ratio_search` refers to the proportion of lower-scoring documents to drop during the search algorithm.
170+
171+
Let's view the results:
172+
173+
```
174+
for hit in results[0]:
175+
print(hit)
176+
```
177+
178+
```
179+
{'id': 454387371651630485, 'distance': 1.3352930545806885, 'entity': {'text': 'information retrieval is a field of study.'}}
180+
{'id': 454387371651630486, 'distance': 0.29726022481918335, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}}
181+
{'id': 454387371651630487, 'distance': 0.2715056240558624, 'entity': {'text': 'data mining and information retrieval overlap in research.'}}
182+
```
183+
184+
## Performing Hybrid Semantic and Full-Text Search
185+
186+
Let's now combine what we've learned to perform a hybrid search that combines separate semantic and full-text searches with a reranker:
187+
188+
```
189+
from pymilvus import AnnSearchRequest, RRFRanker
190+
query = 'whats the focus of information retrieval?'
191+
query_dense_vector = embedding_fn([query])
192+
193+
search_param_1 = {
194+
"data": query_dense_vector,
195+
"anns_field": "dense",
196+
"param": {
197+
"metric_type": "COSINE",
198+
},
199+
"limit": 3
200+
}
201+
request_1 = AnnSearchRequest(**search_param_1)
202+
203+
search_param_2 = {
204+
"data": [query],
205+
"anns_field": "sparse",
206+
"param": {
207+
"metric_type": "BM25",
208+
"params": {"drop_ratio_build": 0.0}
209+
},
210+
"limit": 3
211+
}
212+
request_2 = AnnSearchRequest(**search_param_2)
213+
214+
reqs = [request_1, request_2]
215+
```
216+
217+
```
218+
ranker = RRFRanker()
219+
220+
res = client.hybrid_search(
221+
collection_name="demo",
222+
output_fields=['text'],
223+
reqs=reqs,
224+
ranker=ranker,
225+
limit=3
226+
)
227+
for hit in res[0]:
228+
print(hit)
229+
```
230+
231+
```
232+
{'id': 454387371651630485, 'distance': 0.032786883413791656, 'entity': {'text': 'information retrieval is a field of study.'}}
233+
{'id': 454387371651630486, 'distance': 0.032258063554763794, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}}
234+
{'id': 454387371651630487, 'distance': 0.0317460335791111, 'entity': {'text': 'data mining and information retrieval overlap in research.'}}
235+
```
236+
237+
As you may have noticed, this is no different than a hybrid search with two separate semantic fields (available since Milvus 2.4). The results are identical to full-text search in this simple example, but for larger databases and keyword specific searches hybrid search typically has higher recall.
238+
239+
## Summary
240+
241+
You're now equipped with all the knowledge needed to perform full-text and hybrid semantic/full-text search with Milvus 2.5. See the following articles for more discussion on how full-text search works and why it is complementary to semantic search:
242+
243+
- [Introducing Milvus 2.5: Full-Text Search, More Powerful Metadata Filtering, and Usability Improvements!](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md)
244+
- [Semantic Search v.s. Full-Text Search: Which Do I Choose in Milvus 2.5?](https://milvus.io/blog/semantic-search-vs-full-text-search-which-one-should-i-choose-with-milvus-2-5.md)

0 commit comments

Comments
 (0)