|
| 1 | +# StillMe Citation Policy - Formal Rules |
| 2 | + |
| 3 | +**Date**: 2025-01-27 |
| 4 | +**Status**: Official Policy |
| 5 | +**Version**: 1.0 |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +StillMe's citation policy ensures **transparency** about knowledge sources while maintaining **clarity** and **usability**. This document provides **formal rules** for when citations are required, optional, or not needed. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Core Principle |
| 16 | + |
| 17 | +**"Every factual claim is cited, but the citation format depends on the knowledge type."** |
| 18 | + |
| 19 | +This means: |
| 20 | +- **Factual claims** → Require citation `[1]`, `[2]` from RAG context |
| 21 | +- **General knowledge** → Optional citation `[general knowledge]` (well-established, pre-2023) |
| 22 | +- **Reasoning** → No citation needed (StillMe's logical inference) |
| 23 | +- **StillMe self-knowledge** → Uses `[foundational knowledge]` (StillMe's architecture) |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## 1. Factual Claims (REQUIRES CITATION) |
| 28 | + |
| 29 | +### Definition |
| 30 | + |
| 31 | +Any statement about the external world that can be verified or falsified. |
| 32 | + |
| 33 | +### Examples |
| 34 | + |
| 35 | +- **Dates**: "Bretton Woods Conference 1944" |
| 36 | +- **Events**: "World War II ended in 1945" |
| 37 | +- **People**: "Keynes proposed the Bretton Woods system" |
| 38 | +- **Places**: "Paris is the capital of France" |
| 39 | +- **Scientific facts**: "Photosynthesis converts CO2 to glucose" |
| 40 | +- **Historical facts**: "The Vietnam War ended in 1975" |
| 41 | + |
| 42 | +### Rule |
| 43 | + |
| 44 | +**MUST cite `[1]`, `[2]` from RAG context.** |
| 45 | + |
| 46 | +If no RAG context is available: |
| 47 | +- Use `[general knowledge]` with explanation: "This is general knowledge from base LLM training data, not verified against StillMe's RAG knowledge base." |
| 48 | +- StillMe should express uncertainty: "Mình không có thông tin này trong RAG knowledge base, nhưng theo kiến thức tổng quát..." |
| 49 | + |
| 50 | +### Implementation |
| 51 | + |
| 52 | +- `CitationRequired` validator enforces this |
| 53 | +- `KnowledgeTypeClassifier` classifies claims as `FACTUAL_CLAIM` |
| 54 | +- Auto-patching adds citation if missing |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## 2. General Knowledge (CITATION OPTIONAL) |
| 59 | + |
| 60 | +### Definition |
| 61 | + |
| 62 | +Well-established facts that are: |
| 63 | +- In base LLM training data (pre-2023 cutoff) |
| 64 | +- Not disputed in academic literature |
| 65 | +- Not time-sensitive |
| 66 | + |
| 67 | +### Examples |
| 68 | + |
| 69 | +- **Scientific facts**: "Water is H2O" |
| 70 | +- **Mathematical facts**: "2+2=4" |
| 71 | +- **Historical facts**: "Shakespeare wrote Hamlet" |
| 72 | +- **Geographical facts**: "Earth orbits the sun" |
| 73 | +- **Physical laws**: "Gravity exists" |
| 74 | + |
| 75 | +### Rule |
| 76 | + |
| 77 | +**Can use `[general knowledge]` without RAG citation**, but must acknowledge: |
| 78 | + |
| 79 | +"This is general knowledge from base LLM training data, not verified against StillMe's RAG knowledge base." |
| 80 | + |
| 81 | +### When to Use |
| 82 | + |
| 83 | +- No RAG context available |
| 84 | +- Claim is well-established (not disputed) |
| 85 | +- Claim is not time-sensitive |
| 86 | +- Claim is common knowledge (not specialized) |
| 87 | + |
| 88 | +### Implementation |
| 89 | + |
| 90 | +- `KnowledgeTypeClassifier` classifies as `GENERAL_KNOWLEDGE` |
| 91 | +- `CitationRequired` validator allows `[general knowledge]` for this type |
| 92 | +- StillMe should still express uncertainty if no RAG verification |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## 3. Reasoning (NO CITATION NEEDED) |
| 97 | + |
| 98 | +### Definition |
| 99 | + |
| 100 | +Logical inference, philosophical analysis, mathematical proofs, or StillMe's own reasoning. |
| 101 | + |
| 102 | +### Examples |
| 103 | + |
| 104 | +- **Logical inference**: "If A then B, therefore C" |
| 105 | +- **Philosophical analysis**: "From a utilitarian perspective, the action is justified because..." |
| 106 | +- **Mathematical proof**: "By induction, we can prove that..." |
| 107 | +- **StillMe's reasoning**: "Based on the evidence provided, StillMe concludes that..." |
| 108 | + |
| 109 | +### Rule |
| 110 | + |
| 111 | +**No citation needed** - this is StillMe's reasoning, not factual claims. |
| 112 | + |
| 113 | +### When to Use |
| 114 | + |
| 115 | +- Answer involves logical inference |
| 116 | +- Answer involves philosophical analysis |
| 117 | +- Answer involves mathematical reasoning |
| 118 | +- Answer is StillMe's own conclusion based on provided evidence |
| 119 | + |
| 120 | +### Implementation |
| 121 | + |
| 122 | +- `KnowledgeTypeClassifier` classifies as `REASONING` |
| 123 | +- `CitationRequired` validator skips citation requirement for this type |
| 124 | +- StillMe can reason without citations |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## 4. StillMe Self-Knowledge (FOUNDATIONAL KNOWLEDGE) |
| 129 | + |
| 130 | +### Definition |
| 131 | + |
| 132 | +Information about StillMe itself (architecture, capabilities, limitations, learning process). |
| 133 | + |
| 134 | +### Examples |
| 135 | + |
| 136 | +- **Architecture**: "StillMe uses RAG with ChromaDB" |
| 137 | +- **Capabilities**: "StillMe learns every 4 hours" |
| 138 | +- **Limitations**: "StillMe cannot answer questions about events < 4 hours old" |
| 139 | +- **Learning process**: "StillMe fetches content from RSS feeds, arXiv, CrossRef, Wikipedia" |
| 140 | + |
| 141 | +### Rule |
| 142 | + |
| 143 | +**Uses `[foundational knowledge]`** - StillMe's self-knowledge, not external sources. |
| 144 | + |
| 145 | +### When to Use |
| 146 | + |
| 147 | +- Question is about StillMe itself |
| 148 | +- Answer describes StillMe's architecture, capabilities, or limitations |
| 149 | +- Answer explains StillMe's learning process or validation chain |
| 150 | + |
| 151 | +### Implementation |
| 152 | + |
| 153 | +- `KnowledgeTypeClassifier` classifies as `STILLME_SELF_KNOWLEDGE` |
| 154 | +- `CitationRequired` validator uses `[foundational knowledge]` for this type |
| 155 | +- StillMe should prioritize foundational knowledge from RAG context |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +## Classification Algorithm |
| 160 | + |
| 161 | +The `KnowledgeTypeClassifier` uses this decision tree: |
| 162 | + |
| 163 | +``` |
| 164 | +1. Is claim about StillMe? |
| 165 | + → YES: STILLME_SELF_KNOWLEDGE |
| 166 | + → NO: Continue |
| 167 | +
|
| 168 | +2. Does claim have RAG context? |
| 169 | + → YES: FACTUAL_CLAIM (requires citation) |
| 170 | + → NO: Continue |
| 171 | +
|
| 172 | +3. Is claim logical inference/reasoning? |
| 173 | + → YES: REASONING (no citation) |
| 174 | + → NO: Continue |
| 175 | +
|
| 176 | +4. Is claim well-established fact (common knowledge, pre-2023)? |
| 177 | + → YES: GENERAL_KNOWLEDGE (citation optional) |
| 178 | + → NO: Continue |
| 179 | +
|
| 180 | +5. Does claim have factual indicators (dates, events, people, places)? |
| 181 | + → YES: FACTUAL_CLAIM (requires citation) |
| 182 | + → NO: FACTUAL_CLAIM (default, requires citation) |
| 183 | +``` |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## Citation Formats |
| 188 | + |
| 189 | +### RAG-Grounded Citations |
| 190 | + |
| 191 | +- **Format**: `[1]`, `[2]`, `[3]` |
| 192 | +- **Meaning**: Information from StillMe's RAG knowledge base |
| 193 | +- **Verification**: Validated against retrieved context documents |
| 194 | + |
| 195 | +### General Knowledge Citations |
| 196 | + |
| 197 | +- **Format**: `[general knowledge]` |
| 198 | +- **Meaning**: Well-established fact from base LLM training data (pre-2023) |
| 199 | +- **Verification**: Not verified against StillMe's RAG knowledge base |
| 200 | + |
| 201 | +### Foundational Knowledge Citations |
| 202 | + |
| 203 | +- **Format**: `[foundational knowledge]` |
| 204 | +- **Meaning**: Information about StillMe itself |
| 205 | +- **Verification**: From StillMe's foundational knowledge documents |
| 206 | + |
| 207 | +### No Citation |
| 208 | + |
| 209 | +- **Format**: (no citation) |
| 210 | +- **Meaning**: StillMe's reasoning, logical inference, or philosophical analysis |
| 211 | +- **Verification**: Not applicable (reasoning, not factual claim) |
| 212 | + |
| 213 | +--- |
| 214 | + |
| 215 | +## Edge Cases |
| 216 | + |
| 217 | +### 1. Mixed Claims |
| 218 | + |
| 219 | +**Scenario**: Answer contains both factual claims and reasoning. |
| 220 | + |
| 221 | +**Rule**: Cite factual claims, but reasoning doesn't need citation. |
| 222 | + |
| 223 | +**Example**: |
| 224 | +> "Bretton Woods Conference 1944 [1] established the IMF. From an economic perspective, this was significant because..." |
| 225 | +
|
| 226 | +- "Bretton Woods Conference 1944" → Factual claim → `[1]` |
| 227 | +- "From an economic perspective..." → Reasoning → No citation |
| 228 | + |
| 229 | +### 2. Factual Claims Without RAG Context |
| 230 | + |
| 231 | +**Scenario**: User asks about a factual topic, but StillMe has no RAG context. |
| 232 | + |
| 233 | +**Rule**: Use `[general knowledge]` with uncertainty expression. |
| 234 | + |
| 235 | +**Example**: |
| 236 | +> "Mình không có thông tin về [topic] trong RAG knowledge base, nhưng theo kiến thức tổng quát, [answer] [general knowledge]" |
| 237 | +
|
| 238 | +### 3. StillMe Questions with RAG Context |
| 239 | + |
| 240 | +**Scenario**: User asks about StillMe, and RAG context contains foundational knowledge. |
| 241 | + |
| 242 | +**Rule**: Use `[foundational knowledge]` and prioritize RAG context over base LLM knowledge. |
| 243 | + |
| 244 | +**Example**: |
| 245 | +> "StillMe uses RAG with ChromaDB [foundational knowledge]. According to StillMe's foundational knowledge documents [1], StillMe learns every 4 hours." |
| 246 | +
|
| 247 | +--- |
| 248 | + |
| 249 | +## Validation |
| 250 | + |
| 251 | +### Validators |
| 252 | + |
| 253 | +1. **`CitationRequired`**: Enforces citation requirement based on knowledge type |
| 254 | +2. **`KnowledgeTypeClassifier`**: Classifies claims into knowledge types |
| 255 | +3. **`CitationRelevance`**: Validates that citations are actually relevant |
| 256 | + |
| 257 | +### Auto-Patching |
| 258 | + |
| 259 | +- If citation is missing for `FACTUAL_CLAIM`, `CitationRequired` auto-adds citation |
| 260 | +- If citation format is wrong, validators can patch it |
| 261 | +- If knowledge type is misclassified, `KnowledgeTypeClassifier` can correct it |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## Transparency |
| 266 | + |
| 267 | +StillMe is transparent about: |
| 268 | +- **Knowledge source**: RAG-grounded vs general knowledge vs foundational knowledge |
| 269 | +- **Verification status**: Verified against RAG vs unverified (general knowledge) |
| 270 | +- **Reasoning**: When StillMe is reasoning vs stating facts |
| 271 | + |
| 272 | +--- |
| 273 | + |
| 274 | +## Revision History |
| 275 | + |
| 276 | +- **2025-01-27**: Initial formal policy document |
| 277 | +- Created to address ambiguity in citation policy |
| 278 | +- Based on architectural review findings |
| 279 | + |
| 280 | +--- |
| 281 | + |
| 282 | +## References |
| 283 | + |
| 284 | +- `stillme_core/knowledge/type_classifier.py`: Implementation |
| 285 | +- `stillme_core/validation/citation.py`: Citation enforcement |
| 286 | +- `docs/ANALYSIS_GENERAL_KNOWLEDGE_CITATION.md`: Analysis of general knowledge citations |
| 287 | + |
0 commit comments