|
| 1 | +# InjecGuard: Over-Defense Mitigation for Prompt Injection Detection |
| 2 | + |
| 3 | +**Date:** 2026-02-01 |
| 4 | +**Paper:** InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models |
| 5 | +**Authors:** Hao Li, Xiaogeng Liu, Chaowei Xiao (University of Wisconsin-Madison) |
| 6 | +**arXiv:** [2410.22770v1](https://arxiv.org/html/2410.22770v1) |
| 7 | +**Published:** October 29, 2024 |
| 8 | + |
| 9 | +## Paper Summary |
| 10 | + |
| 11 | +### Problem Statement and Motivation |
| 12 | + |
| 13 | +InjecGuard addresses a critical limitation in existing prompt guard models: **over-defense**, where benign inputs are incorrectly flagged as malicious due to trigger word bias. This phenomenon severely impacts usability as legitimate user requests containing words like "ignore", "cancel", or "override" get blocked, reducing LLM accessibility in real-world applications like virtual assistants and diagnostic tools. |
| 14 | + |
| 15 | +> "Over-defense arises when models misclassify inputs due to reliance on shortcuts, resulting in false positives where benign inputs are incorrectly flagged as threats." |
| 16 | +
|
| 17 | +The authors demonstrate that current state-of-the-art models like ProtectAIv2 achieve over-defense accuracy as low as 56.64% (barely better than random guessing at 50%) on their NotInject benchmark. |
| 18 | + |
| 19 | +### Proposed Approach |
| 20 | + |
| 21 | +#### 1. NotInject Benchmark Dataset |
| 22 | +- **339 benign samples** containing trigger words common in prompt injection attacks |
| 23 | +- **Three difficulty levels**: 1, 2, or 3 trigger words per sample |
| 24 | +- Systematically constructed using: |
| 25 | + - Frequency analysis between benign and malicious datasets |
| 26 | + - LLM-based filtering and manual verification |
| 27 | + - GPT-4o-mini generation with safety refinement |
| 28 | + |
| 29 | +#### 2. Mitigating Over-defense for Free (MOF) Training Strategy |
| 30 | + |
| 31 | +MOF is a novel training approach that addresses over-defense without requiring specific over-defense datasets: |
| 32 | + |
| 33 | +1. **Standard Training**: Train model normally on curated dataset (61,089 benign + 15,666 injection samples) |
| 34 | +2. **Token-wise Bias Detection**: Test every token in vocabulary individually - tokens predicted as "attack" reveal model bias |
| 35 | +3. **Adaptive Data Generation**: Generate 1,000 benign samples using combinations of biased tokens (1-3 tokens) |
| 36 | +4. **Retraining from Scratch**: Combine original + generated data for final model training |
| 37 | + |
| 38 | +> "After identifying the biased tokens, we prompt GPT-4o-mini to generate benign data using random combinations of these tokens." |
| 39 | +
|
| 40 | +#### 3. Data-Centric Augmentation |
| 41 | +Addresses long-tail formats in prompt injection attacks by generating samples in 17 different formats: Email, Document, Chat, JSON, Code, Markdown, HTML, URL, Base64, Table, XML, CSV, Config File, Log File, Image Link, Translation, Website. |
| 42 | + |
| 43 | +### Key Results and Benchmarks |
| 44 | + |
| 45 | +**Performance Metrics (InjecGuard vs. competitors):** |
| 46 | +- **Average Accuracy:** 83.48% (vs. ProtectAIv2: 63.81%, +30.8% improvement) |
| 47 | +- **Over-defense Accuracy:** 87.32% (vs. ProtectAIv2: 56.64%, +54.17% improvement) |
| 48 | +- **Benign Accuracy:** 85.74% |
| 49 | +- **Malicious Accuracy:** 77.39% |
| 50 | +- **Inference Time:** 15.34ms (503x faster than GPT-4o at 7907.18ms) |
| 51 | +- **GFLOPs:** 60.45 (comparable to other DeBERTa models) |
| 52 | + |
| 53 | +**Baseline Comparison:** |
| 54 | +``` |
| 55 | +Model | Over-defense | Benign | Malicious | Average |
| 56 | +Fmops | 5.60% | 34.63% | 93.50% | 44.58% |
| 57 | +Deepset | 5.31% | 34.06% | 91.50% | 43.62% |
| 58 | +PromptGuard | 0.88% | 26.82% | 97.10% | 41.60% |
| 59 | +ProtectAIv2 | 56.64% | 86.20% | 48.60% | 63.81% |
| 60 | +InjecGuard | 87.32% | 85.74% | 77.39% | 83.48% |
| 61 | +``` |
| 62 | + |
| 63 | +### Architecture Description |
| 64 | + |
| 65 | +- **Backbone:** DeBERTa-v3-base (same as ProtectAI and PromptGuard) |
| 66 | +- **Training Details:** 32 batch size, 3 epochs, Adam optimizer, 2e-5 learning rate, 512 max tokens |
| 67 | +- **Attention Mechanism:** Unlike ProtectAIv2 which shows excessive attention to trigger words, InjecGuard distributes attention across entire input context |
| 68 | + |
| 69 | +## Feature Delta with LLMTrace |
| 70 | + |
| 71 | +| Feature | InjecGuard | LLMTrace Security | Gap | |
| 72 | +|---------|------------|-------------------|-----| |
| 73 | +| **Architecture** | DeBERTa-v3-base classification | DeBERTa-v2 + Regex hybrid | Minor - similar transformer base | |
| 74 | +| **Over-defense Mitigation** | ✅ MOF strategy with bias detection | ❌ No explicit over-defense handling | **Critical gap** | |
| 75 | +| **Trigger Word Analysis** | ✅ Automatic bias token identification | ❌ Static regex patterns only | **Major gap** | |
| 76 | +| **Training Strategy** | ✅ Adaptive retraining from scratch | ❌ Standard fine-tuning | **Significant gap** | |
| 77 | +| **Evaluation Framework** | ✅ 3D metrics (benign/malicious/over-defense) | ❌ Binary classification only | **Major gap** | |
| 78 | +| **Encoding Detection** | ✅ Base64/ROT13/leetspeak/reversed | ✅ Base64 + manual pattern detection | Moderate gap | |
| 79 | +| **Multi-language Support** | ✅ Includes Chinese, Russian samples | ❌ English-focused patterns | Moderate gap | |
| 80 | +| **Format Coverage** | ✅ 17 data formats (CSV, XML, JSON, etc.) | ✅ Basic format detection | Minor gap | |
| 81 | +| **Jailbreak Detection** | ✅ General prompt injection focus | ✅ Dedicated jailbreak detector | **LLMTrace strength** | |
| 82 | +| **PII Detection** | ❌ Not covered | ✅ Comprehensive PII patterns + validation | **LLMTrace strength** | |
| 83 | +| **Agent Action Analysis** | ❌ Not covered | ✅ Command/file/web analysis | **LLMTrace strength** | |
| 84 | +| **Streaming Support** | ❌ Batch processing only | ✅ Real-time streaming analysis | **LLMTrace strength** | |
| 85 | +| **Ensemble Approach** | ❌ Single model | ✅ ML + Regex ensemble | **LLMTrace strength** | |
| 86 | +| **Open Source** | ✅ Fully open (model, data, code) | ✅ Open source | Equal | |
| 87 | + |
| 88 | +### What InjecGuard Does That We Don't |
| 89 | + |
| 90 | +1. **Systematic Over-defense Mitigation**: MOF training strategy specifically targets and reduces false positives |
| 91 | +2. **Bias Token Detection**: Automated identification of problematic tokens that cause over-defense |
| 92 | +3. **Three-Dimensional Evaluation**: Separate metrics for benign, malicious, and over-defense scenarios |
| 93 | +4. **Adaptive Training Data Generation**: Automatically generates training data to counteract discovered biases |
| 94 | +5. **NotInject-style Benchmarking**: Specific evaluation of model performance on benign inputs with trigger words |
| 95 | + |
| 96 | +### What We Do That InjecGuard Doesn't |
| 97 | + |
| 98 | +1. **Specialized Jailbreak Detection**: Dedicated module with encoding evasion detection (Base64, ROT13, leetspeak, reversed text) |
| 99 | +2. **PII Detection & Validation**: Comprehensive personal information detection with checksum validation |
| 100 | +3. **Agent Security Analysis**: Detection of dangerous commands, suspicious URLs, sensitive file access |
| 101 | +4. **Streaming Security Monitoring**: Real-time analysis of content deltas |
| 102 | +5. **Hybrid Architecture**: Combination of ML and regex approaches for robustness |
| 103 | +6. **Granular Threat Categories**: Multiple finding types (prompt_injection, role_injection, data_leakage, etc.) |
| 104 | + |
| 105 | +### Where We're Aligned |
| 106 | + |
| 107 | +- DeBERTa backbone for ML classification |
| 108 | +- Multi-format input support |
| 109 | +- Open source approach |
| 110 | +- Focus on lightweight, efficient models |
| 111 | +- Encoding attack detection capabilities |
| 112 | + |
| 113 | +## Actionable Recommendations |
| 114 | + |
| 115 | +### P0 (Critical - Immediate Implementation) |
| 116 | + |
| 117 | +1. **Implement Over-defense Detection** |
| 118 | + - **Effort:** 2-3 weeks |
| 119 | + - Add evaluation metric for over-defense accuracy in `MLSecurityAnalyzer` |
| 120 | + - Create benchmark dataset similar to NotInject for LLMTrace testing |
| 121 | + - **Code Impact:** New metric in `AnalysisContext`, test dataset in `tests/` |
| 122 | + |
| 123 | +2. **MOF-Inspired Training Strategy** |
| 124 | + - **Effort:** 3-4 weeks |
| 125 | + - Implement token-wise bias detection during model training |
| 126 | + - Add adaptive training data generation for biased tokens |
| 127 | + - **Code Impact:** New training pipeline in `ml_detector.rs`, bias detection utilities |
| 128 | + |
| 129 | +### P1 (High Priority - Next Quarter) |
| 130 | + |
| 131 | +3. **Three-Dimensional Evaluation Framework** |
| 132 | + - **Effort:** 2 weeks |
| 133 | + - Separate benign/malicious/over-defense accuracy tracking |
| 134 | + - Update `SecurityFinding` to include over-defense classification |
| 135 | + - **Code Impact:** Enhanced metrics in `inference_stats.rs` |
| 136 | + |
| 137 | +4. **Enhanced Trigger Word Analysis** |
| 138 | + - **Effort:** 3 weeks |
| 139 | + - Extend jailbreak detector to identify problematic trigger words |
| 140 | + - Add attention weight analysis for bias detection |
| 141 | + - **Code Impact:** Updates to `jailbreak_detector.rs`, new attention analysis module |
| 142 | + |
| 143 | +### P2 (Medium Priority - Future Releases) |
| 144 | + |
| 145 | +5. **Multi-language Trigger Detection** |
| 146 | + - **Effort:** 4-5 weeks |
| 147 | + - Extend pattern detection to Chinese, Russian, other languages |
| 148 | + - Add Unicode normalization improvements |
| 149 | + - **Code Impact:** Updates to `normalise.rs`, new language patterns |
| 150 | + |
| 151 | +6. **Format-Specific Training** |
| 152 | + - **Effort:** 2-3 weeks |
| 153 | + - Generate training data for underrepresented formats (CSV, XML, etc.) |
| 154 | + - **Code Impact:** Training data augmentation scripts |
| 155 | + |
| 156 | +### Potential Code/Model Integration |
| 157 | + |
| 158 | +1. **InjecGuard Model Integration** |
| 159 | + - Consider adding InjecGuard as alternative backbone in `MLSecurityConfig` |
| 160 | + - Compare performance against current ProtectAI DeBERTa model |
| 161 | + - **HuggingFace Model:** Not yet available (paper just released) |
| 162 | + |
| 163 | +2. **NotInject Dataset** |
| 164 | + - Use as additional test suite for LLMTrace models |
| 165 | + - Available at: `https://github.com/SaFoLab-WISC/InjecGuard` |
| 166 | + |
| 167 | +3. **MOF Training Code** |
| 168 | + - Adapt their bias detection algorithm for our ensemble approach |
| 169 | + - Integrate with existing `FusionClassifier` architecture |
| 170 | + |
| 171 | +## Key Metrics Comparison |
| 172 | + |
| 173 | +### InjecGuard Reported Performance |
| 174 | +- **Precision/Recall:** Not explicitly reported (accuracy-focused evaluation) |
| 175 | +- **F1 Score:** Not reported |
| 176 | +- **Average Accuracy:** 83.48% |
| 177 | +- **Over-defense Accuracy:** 87.32% (most critical metric) |
| 178 | +- **Malicious Detection:** 77.39% |
| 179 | +- **Benign Recognition:** 85.74% |
| 180 | + |
| 181 | +### Comparison to ProtectAI DeBERTa (Our Current Model) |
| 182 | +InjecGuard vs. ProtectAI DeBERTa v2 (which we use): |
| 183 | +- **Overall Performance:** +30.8% improvement in average accuracy |
| 184 | +- **Over-defense:** +54.17% improvement (87.32% vs 56.64%) |
| 185 | +- **Architecture:** Similar (both DeBERTa-based) |
| 186 | +- **Efficiency:** Comparable inference time (~15ms) |
| 187 | + |
| 188 | +### NotInject Benchmark Results |
| 189 | +Current SOTA models perform poorly on over-defense: |
| 190 | +- **PromptGuard (Meta):** 0.88% over-defense accuracy |
| 191 | +- **Deepset:** 5.31% over-defense accuracy |
| 192 | +- **Fmops:** 5.60% over-defense accuracy |
| 193 | +- **ProtectAI v2:** 56.64% over-defense accuracy |
| 194 | + |
| 195 | +> "None of the existing open-source prompt guard models achieve an over-defense accuracy greater than 60%, where 50% represents random guessing." |
| 196 | +
|
| 197 | +## Technical Implementation Notes |
| 198 | + |
| 199 | +### MOF Algorithm Adaptation for LLMTrace |
| 200 | + |
| 201 | +```rust |
| 202 | +// Potential integration in MLSecurityAnalyzer |
| 203 | +pub struct OverDefenseDetector { |
| 204 | + pub bias_tokens: HashSet<String>, |
| 205 | + pub over_defense_threshold: f64, |
| 206 | +} |
| 207 | + |
| 208 | +impl OverDefenseDetector { |
| 209 | + // Test each vocabulary token individually |
| 210 | + pub fn detect_bias_tokens(&self, model: &MLSecurityAnalyzer) -> Vec<String> { |
| 211 | + // Implementation similar to InjecGuard's token-wise recheck |
| 212 | + } |
| 213 | + |
| 214 | + // Generate benign samples with biased tokens |
| 215 | + pub fn generate_debiasing_samples(&self, bias_tokens: &[String]) -> Vec<String> { |
| 216 | + // LLM-based generation for training data augmentation |
| 217 | + } |
| 218 | +} |
| 219 | +``` |
| 220 | + |
| 221 | +### Integration with Existing Architecture |
| 222 | + |
| 223 | +The MOF strategy could be integrated into our ensemble approach: |
| 224 | +1. Apply bias detection to our ML models |
| 225 | +2. Generate additional training data for problematic patterns |
| 226 | +3. Retrain ensemble components with augmented dataset |
| 227 | +4. Maintain compatibility with existing regex-based detection |
| 228 | + |
| 229 | +This represents a significant opportunity to improve LLMTrace's robustness against over-defense while maintaining our strengths in PII detection, agent security analysis, and real-time streaming capabilities. |
0 commit comments