Description
Problem: Azure AI Search Returns Results for Garbage or Random Words.
I just wanted to know what is the right way of using query rewrite or regular semantic hybrid search with non query rewriting on how I can automatically avoid lot of these results for really bad and no related words such as 'aaaa' or 's*x' or any such un related words.
I though of using the re-ranking score but even for word like 'xxxxxx' reranking score is greater than 2.5. If I use threshold like 2 then these results also pop up, if I use 2.5 as threshold then even for good search query lot of matching results are lost.
Documents
I have 40 documents in the search index. Each document contains a product title and description.
Queries When Results Are Not Expected
Try 1: Using the Older Version of Azure AI Search Without Recent Query Rewrite
(Refer: [Azure AI Search Query Rewrite Documentation](https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-rewrite))
Scenarios Inside This Try
- Just Semantic Search
- Semantic Hybrid Search (Semantic + Vectorization)
Case A: Just Semantic Search
- Input: 'aaaaaaaaaaaaaaaaa' or 'S*x' or 'random'
- Code:
results = search_client.search(
search_text=input_data,
select=["experienceTitle", "experienceDescription"],
semantic_configuration_name='barsv3',
query_type="semantic",
query_language="en-US",
query_speller='lexicon',
top=3
)
- Output: As expected, empty results.
Case B: Semantic Hybrid Search
- Input: 'aaaaaaaaaaaaaaaaa' or 'S*x' or 'random'
- Code:
vector_query = VectorizedQuery(
vector=embedding,
k_nearest_neighbors=50,
exhaustive=True,
fields="experienceDescriptionVector,experienceTitleVector"
)
search_client = SearchClient(
endpoint=endpoint,
index_name='bars-v3',
credential=credential,
api_version='2024-11-01-preview'
)
results = search_client.search(
search_text=input_data,
vector_queries=[vector_query],
select=["experienceTitle", "experienceDescription"],
semantic_configuration_name='barsv3',
query_type="semantic",
query_language="en-US",
query_speller='lexicon',
top=3
)
- Output: Not as expected. Results are returned even though they shouldn’t.
- Search Results for 'aaaaaaa':
[ {"productis": 0, "score": 0.0234118290245533, "reranker_score": 1.6579372882843018}, {"productis": 1, "score": 0.026050420477986336, "reranker_score": 1.6370235681533813}, {"productis": 2, "score": 0.025913622230291367, "reranker_score": 1.626389503479004}, {"productis": 3, "score": 0.03205128386616707, "reranker_score": 1.618236780166626} ]
- Search Results for 'aaaaaaa':
Decision: Use regular semantic search due to errors caused by Semantic Hybrid Search.
Try 2: Newer Version of Azure AI Search Including Query Rewriting
(Refer: [Azure AI Search Query Rewrite Documentation](https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-rewrite))
Scenarios Inside This Try
- Just Semantic Search + Query Rewrite
- Semantic Hybrid Search + Query Rewrite
Case A: Just Semantic Search + Query Rewrite
- Input: 'aaaaaaaaaaaaaaaaa' or 'S*x' or 'random'
- Code:
results = search_client.search(
search_text=input_data,
select=["experienceTitle", "experienceDescription"],
semantic_configuration_name='barsv3',
query_type="semantic",
query_language="en-US",
query_speller='lexicon',
query_rewrites="generative",
debug="queryRewrites",
top=4
)
- Output: Not as expected.
- Search Results for 'aaaaaaa':
[ "meaning of aaaaaaaa", "what does aaaaaaaa mean", "define aaaaaaa", "aaaaaaa meaning" ] [ {"productis": 0, "score": 0.7754897, "reranker_score": 1.6579372882843018}, {"productis": 1, "score": 0.27041504, "reranker_score": 1.6370235681533813}, {"productis": 2, "score": 1.0258656, "reranker_score": 1.618236780166626}, {"productis": 3, "score": 0.20604418, "reranker_score": 1.524656891822815} ]
- Search Results for 'aaaaaaa':
Case B: Semantic Hybrid Search + Query Rewrite
- Input: 'aaaaaaaaaaaaaaaaa' or 'S*x' or 'random'
- Code:
results = search_client.search(
search_text=input_data,
select=["experienceTitle", "experienceDescription"],
semantic_configuration_name='barsv3',
query_type="semantic",
query_language="en-US",
query_speller='lexicon',
query_rewrites="generative",
debug="queryRewrites",
top=4
)
- Output: Not as expected. Results returned despite nonsensical input.