Skip to content

Semantic search#109

Open
khaledk2 wants to merge 11 commits intoome:mainfrom
khaledk2:sematic_search
Open

Semantic search#109
khaledk2 wants to merge 11 commits intoome:mainfrom
khaledk2:sematic_search

Conversation

@khaledk2
Copy link
Copy Markdown
Collaborator

The semantic search uses machine learning to understand the meaning behind words rather than just matching keywords.
It can find related concepts, handle natural language queries, and retrieve results based on context rather than exact words.

I’ve built a basic prototype for a feature that allows users to perform semantic search locally. It has been tested with IDR data.

I have conducted tests with various queries, and the results appear to be promising. For instance, I used the following query

Provide me with images related to cancer

The top results items are:

  • Number of images: 90642, Pathology is carcinoid, malignant, nos
  • Number of images: 111381, Pathology is carcinoma, embryonal, nos
  • Number of images: 37620, Pathology is carcinoma, nos
  • Number of images: 15984, Pathology is adenocarcinoma, metastatic, nos
  • Number of images: 1029, Pathology is glioma, malignant, nos
  • Number of images: 915, Pathology is neoplasm, malignant, nos
  • Number of images: 166130, Pathology is glioma, malignant, high grade
  • Number of images: 3, CIS - Tumors is 18798
  • Number of images: 75955, Pathology is glioma, malignant, low grade
  • Number of images: 1464774, Pathology is adenocarcinoma, nos

As you can see, it returns the metadata related to the query, this can be extended to build a query automatically that returns all the detailed results, as we have in the exact match queries.

Under the hood, it uses vector search capabilities in Elasticsearch , which requires embedding the data using an NLP model.
This prototype feature uses the all-MiniLM-L6-v2 model from Sentence Transformers. It’s small and fast enough to run locally while providing good sentence embeddings.
This model is used to embeddings the searchengine cached data (searchengine metadata) then the searchengine sends them to Elasticsearch to be saved in dedicated fields, which are pre-defined in the template. This has been implemented for the data source's cached data. The query term is also encoded using the same model before sending the query to Elasticsearch to return the query results.

The user can access this feature through the following endpoint:

/api/v1/resources/semanticsearch/?query_text=Provide me with images related to cancer

I will deploy it to a server, allowing the team to collaborate and provide feedback.

@khaledk2
Copy link
Copy Markdown
Collaborator Author

The semantic search is deployed in the idr-testing

It supports questions like this

Provide me with images related to liver cancer

It uses a new endpoint /semanticsearch, the user should provide a query text (query_text)
It is possible to test it using the Swagger document
https://idr-testing.openmicroscopy.org/searchengine/apidocs/#/semantic%20search/get_searchengine__api_v1_resources_semanticsearch_

or directly using the API
https://idr-testing.openmicroscopy.org/searchengine//api/v1/resources/semanticsearch/?query_text=Provide%20me%20with%20images%20related%20to%20liver%20cancer

@joshmoore joshmoore changed the title Sematic search Semantic search Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant