-
Notifications
You must be signed in to change notification settings - Fork 82
Description
GitHub Issue: Inconsistent Citation Results Across Identical Queries with Multi-Document Sets
Description
When submitting the same query multiple times across a multi-document set, the system returns inconsistent citation results. For example:
- First query: Returns citations from Document A (searching across 5 documents)
- Second query (identical): Returns citations from Documents B and C, but excludes Document A (searching across the same 5 documents)
This inconsistency undermines user confidence in the retrieval system and makes results unpredictable.
Expected Behavior
Identical queries submitted against the same multi-document set should return consistent citation results, maintaining deterministic ranking and document selection across requests.
Current Behavior
- Same query produces different citation sets on subsequent executions
- Documents that appear in first query results may be excluded in second query results
- Citation ordering and relevance scores appear non-deterministic
Root Cause Analysis
After investigating the codebase, several factors likely contribute to this inconsistency:
1. Azure AI Search Semantic Ranking Non-Determinism
Location: functions_search.py (Lines 1-280)
The hybrid search uses Azure AI Search with semantic ranking enabled:
query_type="semantic",
semantic_configuration_name="nexus-user-index-semantic-configuration",
query_caption="extractive",
query_answer="extractive",Issue: Azure AI Search's semantic ranker can produce slightly different scores across identical queries due to:
- Internal model variations
- Non-deterministic tie-breaking when scores are similar
- Semantic reranking behavior that may vary slightly per request
2. Multi-Index Result Merging Without Normalization
Location: functions_search.py (Lines 133-139)
When doc_scope="all", results are merged from three separate indexes:
user_results_final = extract_search_results(user_results, top_n)
group_results_final = extract_search_results(group_results, top_n)
public_results_final = extract_search_results(public_results, top_n)
results = user_results_final + group_results_final + public_results_finalIssue:
- Scores from different indexes may be on different scales
- No score normalization occurs before merging
- Final sorting (Line 258) treats all scores as directly comparable when they may not be
3. Score-Based Sorting Without Secondary Sort Keys
Location: functions_search.py (Line 258)
results = sorted(results, key=lambda x: x['score'], reverse=True)[:top_n]Issue:
- When multiple documents have identical or near-identical scores, sort order becomes undefined
- Python's sort is stable but doesn't guarantee consistent ordering across different result sets
- No secondary sort criteria (e.g., document_id, timestamp, file_name) to ensure deterministic ordering
4. No Result Caching or Request Deduplication
Location: route_backend_chats.py (Lines 559-639)
Each request executes a fresh search with no caching:
search_results = hybrid_search(**search_args)Issue:
- Every request hits Azure AI Search independently
- No mechanism to detect and return cached results for identical queries
- No session-based consistency guarantee
Impact
- User Trust: Inconsistent results reduce confidence in the system
- Reproducibility: Users cannot reliably reference or share specific query results
- Testing/Validation: Difficult to validate system accuracy when results vary
- Enterprise Adoption: Organizations require consistent behavior for compliance and audit purposes
Affected Components
- functions_search.py -
hybrid_search()function - functions_search.py -
extract_search_results()function - route_backend_chats.py - Chat endpoint search integration
- functions_conversation_metadata.py - Citation tracking
- route_backend_documents.py -
/api/get_citationendpoint
Proposed Solutions
Option 1: Deterministic Sorting (Quick Fix)
Add secondary sort keys to ensure consistent ordering when scores are equal:
# In functions_search.py, line 258
results = sorted(
results,
key=lambda x: (
-x['score'], # Primary: score (descending)
x['file_name'], # Secondary: filename (ascending)
x['chunk_sequence'] # Tertiary: chunk order (ascending)
)
)[:top_n]Option 2: Score Normalization (Medium Fix)
Normalize scores from different indexes before merging:
def normalize_scores(results, min_score=0.0, max_score=1.0):
"""Normalize search scores to a consistent range."""
if not results:
return results
scores = [r['score'] for r in results]
min_s, max_s = min(scores), max(scores)
range_s = max_s - min_s if max_s > min_s else 1.0
for r in results:
r['normalized_score'] = min_score + ((r['score'] - min_s) / range_s) * (max_score - min_score)
return results
# Apply normalization to each index result before merging
user_results_final = normalize_scores(extract_search_results(user_results, top_n))
group_results_final = normalize_scores(extract_search_results(group_results, top_n))
public_results_final = normalize_scores(extract_search_results(public_results, top_n))Option 3: Result Caching (Comprehensive Fix)
Implement query-based caching to return identical results for identical queries within a session:
from functools import lru_cache
import hashlib
def generate_search_cache_key(query, user_id, document_id, doc_scope, active_group_id, top_n):
"""Generate a cache key for search results."""
key_data = f"{query}|{user_id}|{document_id}|{doc_scope}|{active_group_id}|{top_n}"
return hashlib.sha256(key_data.encode()).hexdigest()
# Implement time-based cache expiration (e.g., 5 minutes)
search_cache = {}
cache_ttl = 300 # seconds
def cached_hybrid_search(query, user_id, document_id=None, top_n=12, doc_scope="all",
active_group_id=None, active_public_workspace_id=None,
enable_file_sharing=True):
cache_key = generate_search_cache_key(query, user_id, document_id, doc_scope,
active_group_id, top_n)
# Check cache
if cache_key in search_cache:
cached_result, cached_time = search_cache[cache_key]
if time.time() - cached_time < cache_ttl:
return cached_result
# Execute search
results = hybrid_search(query, user_id, document_id, top_n, doc_scope,
active_group_id, active_public_workspace_id,
enable_file_sharing)
# Store in cache
search_cache[cache_key] = (results, time.time())
return results