You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As humans, we interpret the world through our senses. We hear sounds, we see images, video, and text, often layered on top of each other. We understand the world through these multiple modalities and the relationship between them. For artificial intelligence to truly match or exceed human capabilities, it must develop this same ability to understand the world through multiple lenses simultaneously.
20
20
21
-
In this post and accompanying video (coming soon) and notebook, we'll showcase recent breakthroughs in models that can process both text and images together. We'll demonstrate this by building a semantic search application that goes beyond simple keyword matching - it understands the relationship between what users are asking for and the visual content they're searching through.
21
+
<iframewidth="1280"height="720"src="https://www.youtube.com/embed/bxE0_QYX_sU"title="Building Multimodal Search with Milvus: Combining Images and Text for Better Search Results"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
22
+
23
+
In this post and accompanying [video](https://www.youtube.com/watch?v=bxE0_QYX_sU) and [notebook](https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/multimodal_retrieval_amazon_reviews.ipynb), we'll showcase recent breakthroughs in models that can process both text and images together. We'll demonstrate this by building a semantic search application that goes beyond simple keyword matching - it understands the relationship between what users are asking for and the visual content they're searching through.
22
24
23
25
What makes this project particularly exciting is that it's built entirely with open-source tools: the Milvus vector database, HuggingFace's machine learning libraries, and a dataset of Amazon customer reviews. It's remarkable to think that just a decade ago, building something like this would have required significant proprietary resources. Today, these powerful components are freely available and can be combined in innovative ways by anyone with the curiosity to experiment.
24
26
@@ -77,7 +79,7 @@ Our LLVM re-ranker was able to perform understanding across images and text, and
77
79
78
80
## Summary
79
81
80
-
In this post and the accompanying video (coming soon) and [notebook](https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/multimodal_retrieval_amazon_reviews.ipynb), we have constructed an application for multimodal semantic search across text and images. The embedding model was able to embed text and images jointly or separately into the same space, and the foundation model was able to input text and image while generating text in response. _Importantly, the embedding model was able to relate the user’s intent of an open-ended instruction to the query image and in that way specify how the user wanted the results to relate to the input image._
82
+
In this post and the accompanying [video](https://www.youtube.com/watch?v=bxE0_QYX_sU) and [notebook](https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/multimodal_retrieval_amazon_reviews.ipynb), we have constructed an application for multimodal semantic search across text and images. The embedding model was able to embed text and images jointly or separately into the same space, and the foundation model was able to input text and image while generating text in response. _Importantly, the embedding model was able to relate the user’s intent of an open-ended instruction to the query image and in that way specify how the user wanted the results to relate to the input image._
81
83
82
84
This is just a taste of what is to come in the near future. We will see many applications of multimodal search, multimodal understanding and reasoning, and so on across diverse modalities: image, video, audio, molecules, social networks, tabular data, time-series, the potential is boundless.
83
85
@@ -88,7 +90,7 @@ And at the core of these systems is a vector database holding the system’s ext
88
90
89
91
- Notebook: [“Multimodal Search with Amazon Reviews and LLVM Reranking](https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/multimodal_retrieval_amazon_reviews.ipynb)”
0 commit comments