Skip to content

Commit ee72708

Browse files
committed
feat: Normalize embeddings to unit vectors for better cosine similarity
- Normalize embeddings in EmbeddingService.encode_text() - Converts embeddings to unit vectors (norm=1.0) - Improves cosine similarity calculations - Reduces distance between semantically similar texts - Should fix high distance issues in ChromaDB retrieval
1 parent ffbc8ae commit ee72708

1 file changed

Lines changed: 16 additions & 1 deletion

File tree

stillme_core/rag/embeddings.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -409,9 +409,24 @@ def encode_text(self, text: Union[str, List[str]]) -> Union[List[float], List[Li
409409
# Generate embeddings
410410
embeddings = self.model.encode(text, convert_to_tensor=False)
411411

412+
# CRITICAL: Normalize embeddings to unit vectors for better cosine similarity
413+
# This improves retrieval accuracy, especially for ChromaDB
414+
import numpy as np
415+
embeddings_array = np.array(embeddings)
416+
if len(embeddings_array.shape) == 1:
417+
# Single embedding vector
418+
norm = np.linalg.norm(embeddings_array)
419+
if norm > 0:
420+
embeddings_array = embeddings_array / norm
421+
else:
422+
# Batch of embeddings
423+
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
424+
norms = np.where(norms > 0, norms, 1.0) # Avoid division by zero
425+
embeddings_array = embeddings_array / norms
426+
412427
# OPTIMIZATION: Cache single text embeddings (both in-memory and Redis)
413428
if isinstance(text, str):
414-
embedding_list = embeddings.tolist() if hasattr(embeddings, 'tolist') else list(embeddings)
429+
embedding_list = embeddings_array.tolist() if hasattr(embeddings_array, 'tolist') else list(embeddings_array)
415430
cache_key = self._get_cache_key(text)
416431

417432
# Update in-memory cache

0 commit comments

Comments
 (0)