Skip to content

Commit a4182cb

Browse files
committed
fix: Improve time estimation and self-knowledge query handling
- Add negative patterns to time estimation intent detection to exclude capability questions - Add exception in ConfidenceValidator for StillMe self-knowledge queries - StillMe should always be able to answer questions about its own features - Fixes issue where StillMe was forced to express uncertainty for self-knowledge queries - Fixes issue where time estimates were added to capability questions Related to: Response quality analysis showing contradictory responses and irrelevant time estimates
1 parent 7106db9 commit a4182cb

3 files changed

Lines changed: 258 additions & 0 deletions

File tree

backend/core/time_estimation_intent.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,25 @@ def detect_time_estimation_intent(query: str) -> Tuple[bool, Optional[str]]:
7474
"""
7575
query_lower = query.lower()
7676

77+
# NEGATIVE PATTERNS: Exclude capability questions
78+
# These patterns indicate the user is asking about capability/feature, not time estimation
79+
negative_patterns = [
80+
r'do you (track|have|support|can|use|provide)',
81+
r'can you (track|have|support|use|provide)',
82+
r'does (stillme|it|the system) (track|have|support|use|provide)',
83+
r'bạn (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
84+
r'stillme (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
85+
r'hệ thống (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
86+
r'what (features|capabilities|functions)',
87+
r'tính năng (nào|gì)',
88+
r'khả năng (nào|gì)',
89+
]
90+
91+
# If query matches negative patterns, it's NOT a time estimation question
92+
for pattern in negative_patterns:
93+
if re.search(pattern, query_lower, re.IGNORECASE):
94+
return (False, None)
95+
7796
# Check for time estimation keywords
7897
has_time_keyword = any(
7998
keyword in query_lower

backend/validators/confidence.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,45 @@ def run(self, answer: str, ctx_docs: List[str], context_quality: Optional[str] =
198198
# BUT: Skip for philosophical questions (theoretical reasoning doesn't need context)
199199
# AND: Skip for religion/roleplay questions (they should answer from identity prompt, not RAG context)
200200
if not is_philosophical and not is_religion_roleplay and (context_quality == "low" or (avg_similarity is not None and avg_similarity < 0.1)):
201+
# CRITICAL: Exception for StillMe self-knowledge queries
202+
# StillMe should always be able to answer questions about its own features/capabilities
203+
# even if RAG retrieval fails (it has foundational knowledge about itself)
204+
is_stillme_self_query = False
205+
if user_question:
206+
question_lower = user_question.lower()
207+
stillme_self_patterns = [
208+
r'do you (track|have|support|can|use|provide|follow)',
209+
r'can you (track|have|support|use|provide|follow)',
210+
r'does (stillme|it|the system) (track|have|support|use|provide|follow)',
211+
r'bạn (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
212+
r'stillme (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
213+
r'hệ thống (có|đã) (theo dõi|có|hỗ trợ|sử dụng|cung cấp)',
214+
r'what (features|capabilities|functions) (does|has) (stillme|it|the system)',
215+
r'stillme (features|capabilities|functions)',
216+
r'tính năng (nào|gì) (của|mà) (stillme|hệ thống)',
217+
r'khả năng (nào|gì) (của|mà) (stillme|hệ thống)',
218+
r'how does stillme (work|track|learn|validate)',
219+
r'stillme (architecture|system|design)',
220+
]
221+
is_stillme_self_query = any(
222+
re.search(pattern, question_lower, re.IGNORECASE)
223+
for pattern in stillme_self_patterns
224+
)
225+
226+
# If this is a StillMe self-knowledge query, don't force uncertainty
227+
# StillMe should be able to answer about itself even without RAG context
228+
if is_stillme_self_query:
229+
logger.info("✅ StillMe self-knowledge query detected - skipping forced uncertainty (StillMe should know about itself)")
230+
# Still check if answer already expresses uncertainty (it might be appropriate)
231+
has_uncertainty = any(
232+
re.search(pattern, answer_lower, re.IGNORECASE)
233+
for pattern in UNCERTAINTY_PATTERNS
234+
)
235+
if not has_uncertainty:
236+
# Answer doesn't express uncertainty, which is fine for self-knowledge
237+
return ValidationResult(passed=True)
238+
# If answer does express uncertainty, continue to normal validation
239+
201240
# Check if answer already expresses uncertainty
202241
has_uncertainty = any(
203242
re.search(pattern, answer_lower, re.IGNORECASE)

docs/RESPONSE_QUALITY_ANALYSIS.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# StillMe Response Quality Analysis
2+
3+
## Test Date: 2025-12-02
4+
5+
### Test 1: "How long will it take to learn 100 articles?"
6+
7+
#### Response Structure
8+
1. **Main Response**: Long explanation about human learning (25-50 hours, 75-150 hours, etc.)
9+
2. **Time Estimate Section**: "Based on my historical performance, I estimate this will take 24-96 minutes (low confidence, 30%)..."
10+
11+
#### Issues Identified
12+
13+
**🔴 Critical Issues:**
14+
15+
1. **Context Confusion**
16+
- Question is ambiguous: Could be about StillMe learning OR human learning
17+
- Response assumes human learning without clarifying
18+
- Time estimate (24-96 minutes) doesn't match the context (learning 100 articles)
19+
- The estimate seems to be for task execution, not article learning
20+
21+
2. **Inconsistent Information**
22+
- Main response: "tens to hundreds of hours" for human learning
23+
- Time estimate: "24-96 minutes" - completely different scale
24+
- No explanation of why StillMe's estimate differs from the main response
25+
26+
3. **Missing Self-Awareness**
27+
- StillMe doesn't clarify: "Are you asking about me (StillMe) learning, or human learning?"
28+
- Doesn't explain that the time estimate is for StillMe's internal task execution, not article learning
29+
30+
**🟡 Moderate Issues:**
31+
32+
1. **Time Estimate Placement**
33+
- Time estimate section appears at the end, disconnected from main response
34+
- Should be integrated or clearly separated with explanation
35+
36+
2. **Source Transparency**
37+
- Main response says "Based on general knowledge from my training data"
38+
- Time estimate says "Based on my historical performance"
39+
- These are different sources but not clearly distinguished
40+
41+
**✅ Good Aspects:**
42+
43+
1. **Transparency about sources** - clearly states "general knowledge from training data"
44+
2. **Honesty about uncertainty** - acknowledges no universal formula
45+
3. **Structured response** - well-organized with tables and sections
46+
4. **Time estimate includes confidence level** - "low confidence, 30%"
47+
48+
---
49+
50+
### Test 2: "Bạn có theo dõi thời gian thực thi của chính mình không?"
51+
52+
#### Response Structure
53+
1. **Forced Uncertainty**: "Mình không có đủ thông tin để trả lời chính xác câu hỏi này"
54+
2. **Main Response**: "Theo Dõi Thời Gian Thực Thi" section explaining StillMe's self-tracking capability
55+
3. **Time Estimate Section**: "24-96 minutes" estimate (unrelated to the question)
56+
57+
#### Issues Identified
58+
59+
**🔴 Critical Issues:**
60+
61+
1. **Contradictory Response**
62+
- Starts with: "Mình không có đủ thông tin để trả lời chính xác"
63+
- Then provides detailed explanation about self-tracking
64+
- This is a direct contradiction - either StillMe knows or doesn't know
65+
66+
2. **Forced Uncertainty Override**
67+
- Log shows: `⚠️ Forced uncertainty expression due to low context quality`
68+
- This validator is overriding StillMe's actual knowledge about itself
69+
- StillMe SHOULD know about its own capabilities (self-tracking is a core feature)
70+
71+
3. **Irrelevant Time Estimate**
72+
- Time estimate section (24-96 minutes) is appended but completely unrelated
73+
- Question is about self-tracking capability, not time estimation
74+
- This suggests the time estimation intent detection is too aggressive
75+
76+
4. **Missing Direct Answer**
77+
- Question: "Do you track your own execution time?"
78+
- Response doesn't directly answer "Yes" or "No"
79+
- Instead, provides explanation but starts with uncertainty disclaimer
80+
81+
**🟡 Moderate Issues:**
82+
83+
1. **Language Consistency**
84+
- Response mixes Vietnamese and English in time estimate section
85+
- Should be fully Vietnamese for Vietnamese questions
86+
87+
2. **Context Quality Issue**
88+
- Log shows: `avg_similarity=0.000 < threshold=0.01` - no reliable context found
89+
- But StillMe should have foundational knowledge about its own features
90+
- This suggests RAG retrieval is failing for self-knowledge queries
91+
92+
**✅ Good Aspects:**
93+
94+
1. **Detailed explanation** - when it does explain, it's comprehensive
95+
2. **AI identity maintained** - mentions "mô hình thống kê" (statistical model)
96+
3. **Citation added** - `[general knowledge]` included
97+
98+
---
99+
100+
## Root Cause Analysis
101+
102+
### From Backend Logs
103+
104+
1. **RAG Retrieval Failure**
105+
```
106+
⚠️ No reliable context found (avg_similarity=0.000 < threshold=0.01)
107+
⚠️ High average distance (29.328) detected - all documents may be irrelevant
108+
```
109+
- StillMe's foundational knowledge about itself is not being retrieved
110+
- This causes forced uncertainty even for self-knowledge questions
111+
112+
2. **Time Estimation Intent Detection Too Aggressive**
113+
```
114+
✅ Added time estimation to response: learn 100 articles
115+
✅ Added time estimation to response: Bạn có theo dõi thời gian thực thi của chính mình không?
116+
```
117+
- Time estimation is being added to questions that don't need it
118+
- The second example is clearly wrong - it's asking about capability, not time
119+
120+
3. **Language Detection Issues**
121+
```
122+
🌐 Vietnamese keywords detected, overriding langdetect result: en -> vi
123+
WARNING: Language mismatch detected: output=vi, input=en
124+
```
125+
- Language detection is causing validation failures
126+
- Responses are being rejected due to language mismatch
127+
128+
4. **Validation Override Problems**
129+
```
130+
⚠️ Forced uncertainty expression due to low context quality
131+
```
132+
- ConfidenceValidator is forcing uncertainty even when StillMe has self-knowledge
133+
- This is breaking StillMe's ability to answer questions about itself
134+
135+
---
136+
137+
## Recommendations
138+
139+
### Immediate Fixes
140+
141+
1. **Fix Time Estimation Intent Detection**
142+
- Add negative patterns: "Do you track...", "Can you...", "Do you have..."
143+
- Don't add time estimates to capability questions
144+
- Only add to "How long will it take to..." questions
145+
146+
2. **Fix Forced Uncertainty for Self-Knowledge**
147+
- Add exception in ConfidenceValidator for StillMe self-knowledge queries
148+
- StillMe should always be able to answer questions about its own features
149+
- Don't force uncertainty when question is about StillMe itself
150+
151+
3. **Improve RAG Retrieval for Foundational Knowledge**
152+
- Lower similarity threshold for CRITICAL_FOUNDATION documents
153+
- Add explicit queries for StillMe self-knowledge
154+
- Ensure foundational knowledge is always retrieved for self-queries
155+
156+
4. **Fix Language Detection**
157+
- Don't override language detection based on Vietnamese keywords in English queries
158+
- Only override when query is actually in Vietnamese
159+
160+
### Long-term Improvements
161+
162+
1. **Better Context Disambiguation**
163+
- When question is ambiguous, StillMe should ask for clarification
164+
- "Are you asking about me (StillMe) learning, or human learning?"
165+
166+
2. **Integrate Time Estimates Better**
167+
- Time estimates should be contextually relevant
168+
- If not relevant, don't add them
169+
- If added, explain why they're relevant
170+
171+
3. **Self-Knowledge Priority**
172+
- StillMe's knowledge about itself should have highest priority
173+
- Never force uncertainty for self-knowledge questions
174+
- Always retrieve foundational knowledge for self-queries
175+
176+
---
177+
178+
## Summary
179+
180+
**Overall Assessment: ⚠️ Needs Improvement**
181+
182+
**Strengths:**
183+
- Transparency about sources
184+
- Structured responses
185+
- AI identity maintained
186+
- Time estimation feature works (when appropriate)
187+
188+
**Weaknesses:**
189+
- Context confusion (human vs StillMe)
190+
- Contradictory responses (uncertainty + detailed explanation)
191+
- Irrelevant time estimates
192+
- RAG retrieval failing for self-knowledge
193+
- Language detection causing validation failures
194+
195+
**Priority Fixes:**
196+
1. Fix time estimation intent detection (too aggressive)
197+
2. Fix forced uncertainty for self-knowledge queries
198+
3. Improve RAG retrieval for foundational knowledge
199+
4. Fix language detection override logic
200+

0 commit comments

Comments
 (0)