Open
Conversation
Collaborator
Author
|
The semantic search is deployed in the idr-testing It supports questions like this
It uses a new endpoint or directly using the API |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The semantic search uses machine learning to understand the meaning behind words rather than just matching keywords.
It can find related concepts, handle natural language queries, and retrieve results based on context rather than exact words.
I’ve built a basic prototype for a feature that allows users to perform semantic search locally. It has been tested with IDR data.
I have conducted tests with various queries, and the results appear to be promising. For instance, I used the following query
Provide me with images related to cancerThe top results items are:
As you can see, it returns the metadata related to the query, this can be extended to build a query automatically that returns all the detailed results, as we have in the exact match queries.
Under the hood, it uses vector search capabilities in Elasticsearch , which requires embedding the data using an NLP model.
This prototype feature uses the all-MiniLM-L6-v2 model from Sentence Transformers. It’s small and fast enough to run locally while providing good sentence embeddings.
This model is used to embeddings the searchengine cached data (searchengine metadata) then the searchengine sends them to Elasticsearch to be saved in dedicated fields, which are pre-defined in the template. This has been implemented for the data source's cached data. The query term is also encoded using the same model before sending the query to Elasticsearch to return the query results.
The user can access this feature through the following endpoint:
/api/v1/resources/semanticsearch/?query_text=Provide me with images related to cancerI will deploy it to a server, allowing the team to collaborate and provide feedback.