fix: Ensure StillMe mentions correct embedding model and improve tracking visibility

anhmtk · anhmtk · commit 086d5a322807 · 2025-12-02T13:00:43.000+07:00
Model Information:
- Create stillme_core/rag/model_info.py as single source of truth
- Add explicit warnings in prompts to prevent mentioning old model (all-MiniLM-L6-v2)
- Ensure StillMe always mentions paraphrase-multilingual-MiniLM-L12-v2

Self-Tracking Visibility:
- Log task tracking at info level (not just debug) for visibility
- StillMe can now see its own tracking in logs
- Helps verify that self-tracking is working correctly
diff --git a/backend/identity/prompt_builder.py b/backend/identity/prompt_builder.py
@@ -525,7 +525,7 @@ def _build_stillme_instruction(self, detected_lang: str) -> str:
 2. **KHÔNG có mechanical disclaimer**: Nếu bạn có foundational knowledge trong context, KHÔNG nói "Dựa trên kiến thức tổng quát (không từ cơ sở tri thức RAG)" - bạn ĐÃ CÓ StillMe knowledge trong context!
 3. **Mention cụ thể về StillMe features**: Khi giải thích sự khác biệt của StillMe, bạn PHẢI mention:
    - **RAG (Retrieval-Augmented Generation)**: StillMe sử dụng RAG với ChromaDB vector database
-   - **Embedding Model**: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, tối ưu cho multilingual Q&A retrieval, hỗ trợ 50+ ngôn ngữ)
+   - **Embedding Model**: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, tối ưu cho multilingual Q&A retrieval, hỗ trợ 50+ ngôn ngữ) - CRITICAL: Đây là model hiện tại, KHÔNG phải all-MiniLM-L6-v2
    - **Continuous Learning**: StillMe học tự động mỗi 4 giờ (6 lần/ngày) từ RSS feeds, arXiv, CrossRef, và Wikipedia
    - **Validation Chain**: Multi-layer validation (CitationRequired, EvidenceOverlap, ConfidenceValidator, FactualHallucinationValidator, FallbackHandler) để giảm hallucinations 80%
    - **Transcends Knowledge Cutoff**: StillMe KHÔNG bị giới hạn bởi training data cutoff dates - nó cập nhật knowledge liên tục qua RAG
@@ -536,7 +536,7 @@ def _build_stillme_instruction(self, detected_lang: str) -> str:
 
 **1. RAG Architecture:**
 - StillMe sử dụng RAG với ChromaDB làm vector database
-- Content được embed bằng paraphrase-multilingual-MiniLM-L12-v2 model
+- Content được embed bằng paraphrase-multilingual-MiniLM-L12-v2 model (KHÔNG phải all-MiniLM-L6-v2)
 - 384-dimensional embeddings tối ưu cho multilingual Q&A retrieval
 - Khi trả lời, StillMe tìm kiếm ChromaDB bằng semantic similarity
 
@@ -582,7 +582,7 @@ def _build_stillme_instruction(self, detected_lang: str) -> str:
 2. **NO mechanical disclaimer**: If you have foundational knowledge in context, DO NOT say "Based on general knowledge (not from StillMe's RAG knowledge base)" - you HAVE StillMe knowledge in context!
 3. **Mention SPECIFIC StillMe features**: When explaining StillMe's differences, you MUST mention:
    - **RAG (Retrieval-Augmented Generation)**: StillMe uses RAG with ChromaDB vector database
-   - **Embedding Model**: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, optimized for multilingual Q&A retrieval, supports 50+ languages)
+   - **Embedding Model**: paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, optimized for multilingual Q&A retrieval, supports 50+ languages) - CRITICAL: This is the CURRENT model, NOT all-MiniLM-L6-v2
    - **Continuous Learning**: StillMe learns automatically every 4 hours (6 cycles/day) from RSS feeds, arXiv, CrossRef, and Wikipedia
    - **Validation Chain**: Multi-layer validation (CitationRequired, EvidenceOverlap, ConfidenceValidator, FactualHallucinationValidator, FallbackHandler) to reduce hallucinations by 80%
    - **Transcends Knowledge Cutoff**: StillMe is NOT limited by training data cutoff dates - it continuously updates knowledge through RAG
@@ -593,7 +593,7 @@ def _build_stillme_instruction(self, detected_lang: str) -> str:
 
 **1. RAG Architecture:**
 - StillMe uses RAG with ChromaDB as vector database
-- Content is embedded using paraphrase-multilingual-MiniLM-L12-v2 model
+- Content is embedded using paraphrase-multilingual-MiniLM-L12-v2 model (NOT all-MiniLM-L6-v2)
 - 384-dimensional embeddings optimized for multilingual Q&A retrieval
 - When answering, StillMe searches ChromaDB using semantic similarity
 
diff --git a/stillme_core/monitoring/self_tracking.py b/stillme_core/monitoring/self_tracking.py
@@ -80,6 +80,9 @@ def track_task_execution(
         estimate_text = estimator.format_estimate(estimate)
         logger.info(f"📊 StillMe self-estimate: {estimate_text}")
         logger.info(f"   I'm an AI system that tracks my own performance to improve estimates over time.")
+    else:
+        # Still log at debug level for internal tracking visibility
+        logger.debug(f"📊 StillMe tracking: {task_description} (estimate: {estimate.estimated_minutes:.2f} min, confidence: {estimate.confidence:.0%})")
     
     try:
         # Yield estimate for use in task
diff --git a/stillme_core/rag/model_info.py b/stillme_core/rag/model_info.py
@@ -0,0 +1,39 @@
+"""
+Model Information for StillMe
+
+Provides accurate model information that StillMe can use in responses.
+This ensures StillMe always mentions the correct model names and versions.
+"""
+
+# CRITICAL: This is the SINGLE SOURCE OF TRUTH for embedding model information
+# If model changes, update this file and re-run foundational knowledge update
+
+EMBEDDING_MODEL_NAME = "paraphrase-multilingual-MiniLM-L12-v2"
+EMBEDDING_MODEL_DIMENSIONS = 384
+EMBEDDING_MODEL_DESCRIPTION = "sentence-transformers model optimized for multilingual Q&A retrieval, supports 50+ languages"
+
+def get_embedding_model_info() -> dict:
+    """
+    Get current embedding model information.
+    
+    Returns:
+        Dictionary with model information:
+        - name: Model name
+        - dimensions: Embedding dimensions
+        - description: Human-readable description
+    """
+    return {
+        "name": EMBEDDING_MODEL_NAME,
+        "dimensions": EMBEDDING_MODEL_DIMENSIONS,
+        "description": EMBEDDING_MODEL_DESCRIPTION
+    }
+
+def get_embedding_model_display_name() -> str:
+    """
+    Get formatted model name for display in responses.
+    
+    Returns:
+        Formatted string: "paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions, ...)"
+    """
+    return f"{EMBEDDING_MODEL_NAME} ({EMBEDDING_MODEL_DIMENSIONS} dimensions, {EMBEDDING_MODEL_DESCRIPTION})"
+