|
| 1 | +--- |
| 2 | +id: get-started-with-hybrid-semantic-full-text-search-with-milvus-2-5.md |
| 3 | +title: Getting Started with Hybrid Semantic / Full-Text Search with Milvus 2.5 |
| 4 | +author: Stefan Webb |
| 5 | +date: 2024-12-17 |
| 6 | +cover: assets.zilliz.com/Full_Text_Search_with_Milvus_2_5_7ba74461be.png |
| 7 | +tag: Engineering |
| 8 | +tags: Milvus |
| 9 | +recommend: false |
| 10 | +canonicalUrl: https://milvus.io/blog/get-started-with-hybrid-semantic-full-text-search-with-milvus-2-5.md |
| 11 | +--- |
| 12 | + |
| 13 | +In this article, we will show you how to quickly get up and running with the new full-text search feature and combine it with the conventional semantic search based on vector embeddings. |
| 14 | + |
| 15 | +## Requirement |
| 16 | + |
| 17 | +First, ensure you have installed Milvus 2.5: |
| 18 | + |
| 19 | + |
| 20 | +``` |
| 21 | +pip install -U pymilvus[model] |
| 22 | +``` |
| 23 | + |
| 24 | +and have a running instance of Milvus Standalone (e.g. on your local machine) using the [installation instructions in the Milvus docs](https://milvus.io/docs/prerequisite-docker.md). |
| 25 | + |
| 26 | +## Building the Data Schema and Search Indices |
| 27 | + |
| 28 | +We import the required classes and functions: |
| 29 | + |
| 30 | +``` |
| 31 | +from pymilvus import MilvusClient, DataType, Function, FunctionType, model |
| 32 | +``` |
| 33 | + |
| 34 | +You may have noticed two new entries for Milvus 2.5, `Function` and `FunctionType`, which we will explain shortly. |
| 35 | + |
| 36 | +Next we open the database with Milvus Standalone, that is, locally, and create the data schema. The schema comprises an integer primary key, a text string, a dense vector of dimension 384, and a sparse vector (of unlimited dimensionality). |
| 37 | +Note that Milvus Lite does not currently support full-text search, only Milvus Standalone and Milvus Distributed. |
| 38 | + |
| 39 | +``` |
| 40 | +client = MilvusClient(uri="http://localhost:19530") |
| 41 | +
|
| 42 | +schema = client.create_schema() |
| 43 | +
|
| 44 | +schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True) |
| 45 | +schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True) |
| 46 | +schema.add_field(field_name="dense", datatype=DataType.FLOAT_VECTOR, dim=768), |
| 47 | +schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR) |
| 48 | +``` |
| 49 | + |
| 50 | +``` |
| 51 | +{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}], 'enable_dynamic_field': False} |
| 52 | +``` |
| 53 | + |
| 54 | +You may have noticed the `enable_analyzer=True` parameter. This tells Milvus 2.5 to enable the lexical parser on this field and build a list of tokens and token frequencies, which are required for full-text search. The `sparse` field will hold a vector representation of the documentation as a bag-of-words produced from the parsing `text`. |
| 55 | + |
| 56 | +But how do we connect the `text` and `sparse` fields, and tell Milvus how `sparse` should be calculated from `text`? This is where we need to invoke the `Function` object and add it to the schema: |
| 57 | + |
| 58 | + |
| 59 | +``` |
| 60 | +bm25_function = Function( |
| 61 | + name="text_bm25_emb", # Function name |
| 62 | + input_field_names=["text"], # Name of the VARCHAR field containing raw text data |
| 63 | + output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings |
| 64 | + function_type=FunctionType.BM25, |
| 65 | +) |
| 66 | +
|
| 67 | +schema.add_function(bm25_function) |
| 68 | +``` |
| 69 | + |
| 70 | + |
| 71 | +``` |
| 72 | +{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 1000, 'enable_analyzer': True}}, {'name': 'dense', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'sparse', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}], 'enable_dynamic_field': False, 'functions': [{'name': 'text_bm25_emb', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]} |
| 73 | +``` |
| 74 | + |
| 75 | +The abstraction of the `Function` object is more general than that of applying full-text search. In the future, it may be used for other cases where one field needs to be a function of another field. In our case, we specify that `sparse` is a function of `text` via the function `FunctionType.BM25`. `BM25` refers to a common metric in information retrieval used for calculating a query's similarity to a document (relative to a collection of documents). |
| 76 | + |
| 77 | +We use the default embedding model in Milvus, which is [paraphrase-albert-small-v2](https://huggingface.co/GPTCache/paraphrase-albert-small-v2): |
| 78 | + |
| 79 | + |
| 80 | +``` |
| 81 | +embedding_fn = model.DefaultEmbeddingFunction() |
| 82 | +``` |
| 83 | + |
| 84 | +The next step is to add our search indices. We have one for the dense vector and a separate one for the sparse vector. The index type is `SPARSE_INVERTED_INDEX` with `BM25` since full-text search requires a different search method than those for standard dense vectors. |
| 85 | + |
| 86 | + |
| 87 | +``` |
| 88 | +index_params = client.prepare_index_params() |
| 89 | +
|
| 90 | +index_params.add_index( |
| 91 | + field_name="dense", |
| 92 | + index_type="AUTOINDEX", |
| 93 | + metric_type="COSINE" |
| 94 | +) |
| 95 | +
|
| 96 | +index_params.add_index( |
| 97 | + field_name="sparse", |
| 98 | + index_type="SPARSE_INVERTED_INDEX", |
| 99 | + metric_type="BM25" |
| 100 | +) |
| 101 | +``` |
| 102 | + |
| 103 | +Finally, we create our collection: |
| 104 | + |
| 105 | +``` |
| 106 | +client.drop_collection('demo') |
| 107 | +client.list_collections() |
| 108 | +``` |
| 109 | + |
| 110 | +``` |
| 111 | +[] |
| 112 | +``` |
| 113 | + |
| 114 | +``` |
| 115 | +client.create_collection( |
| 116 | + collection_name='demo', |
| 117 | + schema=schema, |
| 118 | + index_params=index_params |
| 119 | +) |
| 120 | +
|
| 121 | +client.list_collections() |
| 122 | +``` |
| 123 | + |
| 124 | + |
| 125 | +``` |
| 126 | +['demo'] |
| 127 | +``` |
| 128 | +And with that, we have an empty database set up to accept text documents and perform semantic and full-text searches! |
| 129 | + |
| 130 | +## Inserting Data and Performing Full-Text Search |
| 131 | + |
| 132 | +Inserting data is no different than previous versions of Milvus: |
| 133 | + |
| 134 | +``` |
| 135 | +docs = [ |
| 136 | + 'information retrieval is a field of study.', |
| 137 | + 'information retrieval focuses on finding relevant information in large datasets.', |
| 138 | + 'data mining and information retrieval overlap in research.' |
| 139 | +] |
| 140 | +
|
| 141 | +embeddings = embedding_fn(docs) |
| 142 | +
|
| 143 | +client.insert('demo', [ |
| 144 | + {'text': doc, 'dense': vec} for doc, vec in zip(docs, embeddings) |
| 145 | +]) |
| 146 | +``` |
| 147 | + |
| 148 | +``` |
| 149 | +{'insert_count': 3, 'ids': [454387371651630485, 454387371651630486, 454387371651630487], 'cost': 0} |
| 150 | +``` |
| 151 | + |
| 152 | +Let's first illustrate a full-text search before we move on to hybrid search: |
| 153 | + |
| 154 | +``` |
| 155 | +search_params = { |
| 156 | + 'params': {'drop_ratio_search': 0.2}, |
| 157 | +} |
| 158 | +
|
| 159 | +results = client.search( |
| 160 | + collection_name='demo', |
| 161 | + data=['whats the focus of information retrieval?'], |
| 162 | + output_fields=['text'], |
| 163 | + anns_field='sparse', |
| 164 | + limit=3, |
| 165 | + search_params=search_params |
| 166 | +) |
| 167 | +``` |
| 168 | + |
| 169 | +The search parameter `drop_ratio_search` refers to the proportion of lower-scoring documents to drop during the search algorithm. |
| 170 | + |
| 171 | +Let's view the results: |
| 172 | + |
| 173 | +``` |
| 174 | +for hit in results[0]: |
| 175 | + print(hit) |
| 176 | +``` |
| 177 | + |
| 178 | +``` |
| 179 | +{'id': 454387371651630485, 'distance': 1.3352930545806885, 'entity': {'text': 'information retrieval is a field of study.'}} |
| 180 | +{'id': 454387371651630486, 'distance': 0.29726022481918335, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}} |
| 181 | +{'id': 454387371651630487, 'distance': 0.2715056240558624, 'entity': {'text': 'data mining and information retrieval overlap in research.'}} |
| 182 | +``` |
| 183 | + |
| 184 | +## Performing Hybrid Semantic and Full-Text Search |
| 185 | + |
| 186 | +Let's now combine what we've learned to perform a hybrid search that combines separate semantic and full-text searches with a reranker: |
| 187 | + |
| 188 | +``` |
| 189 | +from pymilvus import AnnSearchRequest, RRFRanker |
| 190 | +query = 'whats the focus of information retrieval?' |
| 191 | +query_dense_vector = embedding_fn([query]) |
| 192 | +
|
| 193 | +search_param_1 = { |
| 194 | + "data": query_dense_vector, |
| 195 | + "anns_field": "dense", |
| 196 | + "param": { |
| 197 | + "metric_type": "COSINE", |
| 198 | + }, |
| 199 | + "limit": 3 |
| 200 | +} |
| 201 | +request_1 = AnnSearchRequest(**search_param_1) |
| 202 | +
|
| 203 | +search_param_2 = { |
| 204 | + "data": [query], |
| 205 | + "anns_field": "sparse", |
| 206 | + "param": { |
| 207 | + "metric_type": "BM25", |
| 208 | + "params": {"drop_ratio_build": 0.0} |
| 209 | + }, |
| 210 | + "limit": 3 |
| 211 | +} |
| 212 | +request_2 = AnnSearchRequest(**search_param_2) |
| 213 | +
|
| 214 | +reqs = [request_1, request_2] |
| 215 | +``` |
| 216 | + |
| 217 | +``` |
| 218 | +ranker = RRFRanker() |
| 219 | +
|
| 220 | +res = client.hybrid_search( |
| 221 | + collection_name="demo", |
| 222 | + output_fields=['text'], |
| 223 | + reqs=reqs, |
| 224 | + ranker=ranker, |
| 225 | + limit=3 |
| 226 | +) |
| 227 | +for hit in res[0]: |
| 228 | + print(hit) |
| 229 | +``` |
| 230 | + |
| 231 | +``` |
| 232 | +{'id': 454387371651630485, 'distance': 0.032786883413791656, 'entity': {'text': 'information retrieval is a field of study.'}} |
| 233 | +{'id': 454387371651630486, 'distance': 0.032258063554763794, 'entity': {'text': 'information retrieval focuses on finding relevant information in large datasets.'}} |
| 234 | +{'id': 454387371651630487, 'distance': 0.0317460335791111, 'entity': {'text': 'data mining and information retrieval overlap in research.'}} |
| 235 | +``` |
| 236 | + |
| 237 | +As you may have noticed, this is no different than a hybrid search with two separate semantic fields (available since Milvus 2.4). The results are identical to full-text search in this simple example, but for larger databases and keyword specific searches hybrid search typically has higher recall. |
| 238 | + |
| 239 | +## Summary |
| 240 | + |
| 241 | +You're now equipped with all the knowledge needed to perform full-text and hybrid semantic/full-text search with Milvus 2.5. See the following articles for more discussion on how full-text search works and why it is complementary to semantic search: |
| 242 | + |
| 243 | +- [Introducing Milvus 2.5: Full-Text Search, More Powerful Metadata Filtering, and Usability Improvements!](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md) |
| 244 | +- [Semantic Search v.s. Full-Text Search: Which Do I Choose in Milvus 2.5?](https://milvus.io/blog/semantic-search-vs-full-text-search-which-one-should-i-choose-with-milvus-2-5.md) |
0 commit comments