|
| 1 | +# StillMe Case Study Template (Before vs After Validation) |
| 2 | + |
| 3 | +Use this template to produce audit-grade evidence for StillMe Lite performance. |
| 4 | + |
| 5 | +Goal: show measurable improvement after enabling verification behavior. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1) Case Study Metadata |
| 10 | + |
| 11 | +- Case ID: |
| 12 | +- Date: |
| 13 | +- Owner: |
| 14 | +- Domain (research/support/policy/other): |
| 15 | +- Language: |
| 16 | +- LLM model: |
| 17 | +- Retrieval setup: |
| 18 | +- StillMe Lite mode (`monitor|warn|enforce`): |
| 19 | +- Policy file version: |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## 2) Test Scope |
| 24 | + |
| 25 | +- Total prompts: |
| 26 | +- Prompt set source: |
| 27 | +- Prompt categories: |
| 28 | + - Factual query |
| 29 | + - Source-required query |
| 30 | + - Ambiguous query |
| 31 | + - Adversarial/prompt-injection |
| 32 | + |
| 33 | +Rules: |
| 34 | +- Keep prompt set fixed between before/after runs. |
| 35 | +- Use same model and retrieval settings for fair comparison. |
| 36 | +- Store raw runs in JSONL for reproducibility. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## 3) Experimental Setup |
| 41 | + |
| 42 | +### 3.1 Before (Baseline) |
| 43 | +- Validation disabled or bypassed. |
| 44 | +- Record model outputs and available context. |
| 45 | + |
| 46 | +### 3.2 After (StillMe Enabled) |
| 47 | +- Validation enabled with selected mode. |
| 48 | +- Record decision, reason codes, and safe response. |
| 49 | + |
| 50 | +### 3.3 Data Files |
| 51 | +- `data/before_<case_id>.jsonl` |
| 52 | +- `data/after_<case_id>.jsonl` |
| 53 | +- `reports/<case_id>_summary.md` |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## 4) Mandatory Metrics |
| 58 | + |
| 59 | +### 4.1 Hallucination Escape Rate |
| 60 | + |
| 61 | +Definition: |
| 62 | +`unsupported_factual_answers_passed / total_high_risk_factual_prompts` |
| 63 | + |
| 64 | +Interpretation: |
| 65 | +- Lower is better. |
| 66 | +- This is the primary risk metric. |
| 67 | + |
| 68 | +### 4.2 Refusal Precision |
| 69 | + |
| 70 | +Definition: |
| 71 | +`correct_refusals / total_refusals` |
| 72 | + |
| 73 | +Interpretation: |
| 74 | +- Higher is better. |
| 75 | +- Measures whether refusals are justified instead of over-blocking. |
| 76 | + |
| 77 | +### 4.3 Source Coverage |
| 78 | + |
| 79 | +Definition: |
| 80 | +`factual_answers_with_valid_citation / total_factual_answers` |
| 81 | + |
| 82 | +Interpretation: |
| 83 | +- Higher is better. |
| 84 | +- Measures evidence grounding quality. |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## 5) Optional Supporting Metrics |
| 89 | + |
| 90 | +- False refusal rate |
| 91 | +- Clarification usefulness rate |
| 92 | +- Mean validator confidence band distribution |
| 93 | +- Decision latency delta (before vs after) |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 6) Output Tables |
| 98 | + |
| 99 | +### 6.1 Core Metrics Table |
| 100 | + |
| 101 | +| Metric | Before | After | Delta | Target | |
| 102 | +|---|---:|---:|---:|---:| |
| 103 | +| Hallucination escape rate | | | | lower | |
| 104 | +| Refusal precision | | | | >= 0.85 | |
| 105 | +| Source coverage | | | | >= 0.80 | |
| 106 | + |
| 107 | +### 6.2 Decision Distribution |
| 108 | + |
| 109 | +| Decision | Before | After | |
| 110 | +|---|---:|---:| |
| 111 | +| answer | | | |
| 112 | +| refuse | | | |
| 113 | +| ask_clarify | | | |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## 7) Error Analysis (Required) |
| 118 | + |
| 119 | +Provide 5-10 representative examples: |
| 120 | + |
| 121 | +1. **Escaped hallucination (before)** |
| 122 | + - Prompt: |
| 123 | + - Baseline answer: |
| 124 | + - Why unsafe: |
| 125 | + |
| 126 | +2. **Correct refusal (after)** |
| 127 | + - Prompt: |
| 128 | + - StillMe decision/reason: |
| 129 | + - Why correct: |
| 130 | + |
| 131 | +3. **False refusal (after, if any)** |
| 132 | + - Prompt: |
| 133 | + - Reason code: |
| 134 | + - Fix candidate: |
| 135 | + |
| 136 | +4. **Citation quality improvement example** |
| 137 | + - Prompt: |
| 138 | + - Before citation state: |
| 139 | + - After citation state: |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## 8) Go / No-Go Rule |
| 144 | + |
| 145 | +Recommended internal gate for moving from `monitor` to `warn`: |
| 146 | +- Hallucination escape rate reduced by at least 50% vs before |
| 147 | +- Refusal precision >= 0.85 |
| 148 | +- Source coverage >= 0.80 on factual subset |
| 149 | + |
| 150 | +Recommended gate for moving from `warn` to `enforce`: |
| 151 | +- Metrics stable for 2 consecutive runs |
| 152 | +- No critical incident in sampled production traffic |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## 9) Risk and Limitation Notes |
| 157 | + |
| 158 | +- Known blind spots: |
| 159 | +- Data quality issues: |
| 160 | +- Retrieval mismatch observations: |
| 161 | +- Policy thresholds that need tuning: |
| 162 | + |
| 163 | +Always include limitations. Do not claim universal safety guarantees. |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## 10) Short Public Summary (Optional) |
| 168 | + |
| 169 | +Use this 4-line format for release notes: |
| 170 | + |
| 171 | +1. What was tested (scope and sample size) |
| 172 | +2. What improved (with numbers) |
| 173 | +3. What did not improve yet |
| 174 | +4. Next action for the next iteration |
| 175 | + |
0 commit comments