Skip to content

Commit 4f5fddb

Browse files
committed
docs: add InjecGuard research analysis
- Comprehensive analysis of InjecGuard paper (arXiv: 2410.22770) - Feature delta comparison with LLMTrace security implementation - Actionable recommendations for over-defense mitigation - MOF training strategy analysis and integration opportunities - Performance metrics comparison with existing models
1 parent 15ca4a1 commit 4f5fddb

1 file changed

Lines changed: 229 additions & 0 deletions

File tree

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# InjecGuard: Over-Defense Mitigation for Prompt Injection Detection
2+
3+
**Date:** 2026-02-01
4+
**Paper:** InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models
5+
**Authors:** Hao Li, Xiaogeng Liu, Chaowei Xiao (University of Wisconsin-Madison)
6+
**arXiv:** [2410.22770v1](https://arxiv.org/html/2410.22770v1)
7+
**Published:** October 29, 2024
8+
9+
## Paper Summary
10+
11+
### Problem Statement and Motivation
12+
13+
InjecGuard addresses a critical limitation in existing prompt guard models: **over-defense**, where benign inputs are incorrectly flagged as malicious due to trigger word bias. This phenomenon severely impacts usability as legitimate user requests containing words like "ignore", "cancel", or "override" get blocked, reducing LLM accessibility in real-world applications like virtual assistants and diagnostic tools.
14+
15+
> "Over-defense arises when models misclassify inputs due to reliance on shortcuts, resulting in false positives where benign inputs are incorrectly flagged as threats."
16+
17+
The authors demonstrate that current state-of-the-art models like ProtectAIv2 achieve over-defense accuracy as low as 56.64% (barely better than random guessing at 50%) on their NotInject benchmark.
18+
19+
### Proposed Approach
20+
21+
#### 1. NotInject Benchmark Dataset
22+
- **339 benign samples** containing trigger words common in prompt injection attacks
23+
- **Three difficulty levels**: 1, 2, or 3 trigger words per sample
24+
- Systematically constructed using:
25+
- Frequency analysis between benign and malicious datasets
26+
- LLM-based filtering and manual verification
27+
- GPT-4o-mini generation with safety refinement
28+
29+
#### 2. Mitigating Over-defense for Free (MOF) Training Strategy
30+
31+
MOF is a novel training approach that addresses over-defense without requiring specific over-defense datasets:
32+
33+
1. **Standard Training**: Train model normally on curated dataset (61,089 benign + 15,666 injection samples)
34+
2. **Token-wise Bias Detection**: Test every token in vocabulary individually - tokens predicted as "attack" reveal model bias
35+
3. **Adaptive Data Generation**: Generate 1,000 benign samples using combinations of biased tokens (1-3 tokens)
36+
4. **Retraining from Scratch**: Combine original + generated data for final model training
37+
38+
> "After identifying the biased tokens, we prompt GPT-4o-mini to generate benign data using random combinations of these tokens."
39+
40+
#### 3. Data-Centric Augmentation
41+
Addresses long-tail formats in prompt injection attacks by generating samples in 17 different formats: Email, Document, Chat, JSON, Code, Markdown, HTML, URL, Base64, Table, XML, CSV, Config File, Log File, Image Link, Translation, Website.
42+
43+
### Key Results and Benchmarks
44+
45+
**Performance Metrics (InjecGuard vs. competitors):**
46+
- **Average Accuracy:** 83.48% (vs. ProtectAIv2: 63.81%, +30.8% improvement)
47+
- **Over-defense Accuracy:** 87.32% (vs. ProtectAIv2: 56.64%, +54.17% improvement)
48+
- **Benign Accuracy:** 85.74%
49+
- **Malicious Accuracy:** 77.39%
50+
- **Inference Time:** 15.34ms (503x faster than GPT-4o at 7907.18ms)
51+
- **GFLOPs:** 60.45 (comparable to other DeBERTa models)
52+
53+
**Baseline Comparison:**
54+
```
55+
Model | Over-defense | Benign | Malicious | Average
56+
Fmops | 5.60% | 34.63% | 93.50% | 44.58%
57+
Deepset | 5.31% | 34.06% | 91.50% | 43.62%
58+
PromptGuard | 0.88% | 26.82% | 97.10% | 41.60%
59+
ProtectAIv2 | 56.64% | 86.20% | 48.60% | 63.81%
60+
InjecGuard | 87.32% | 85.74% | 77.39% | 83.48%
61+
```
62+
63+
### Architecture Description
64+
65+
- **Backbone:** DeBERTa-v3-base (same as ProtectAI and PromptGuard)
66+
- **Training Details:** 32 batch size, 3 epochs, Adam optimizer, 2e-5 learning rate, 512 max tokens
67+
- **Attention Mechanism:** Unlike ProtectAIv2 which shows excessive attention to trigger words, InjecGuard distributes attention across entire input context
68+
69+
## Feature Delta with LLMTrace
70+
71+
| Feature | InjecGuard | LLMTrace Security | Gap |
72+
|---------|------------|-------------------|-----|
73+
| **Architecture** | DeBERTa-v3-base classification | DeBERTa-v2 + Regex hybrid | Minor - similar transformer base |
74+
| **Over-defense Mitigation** | ✅ MOF strategy with bias detection | ❌ No explicit over-defense handling | **Critical gap** |
75+
| **Trigger Word Analysis** | ✅ Automatic bias token identification | ❌ Static regex patterns only | **Major gap** |
76+
| **Training Strategy** | ✅ Adaptive retraining from scratch | ❌ Standard fine-tuning | **Significant gap** |
77+
| **Evaluation Framework** | ✅ 3D metrics (benign/malicious/over-defense) | ❌ Binary classification only | **Major gap** |
78+
| **Encoding Detection** | ✅ Base64/ROT13/leetspeak/reversed | ✅ Base64 + manual pattern detection | Moderate gap |
79+
| **Multi-language Support** | ✅ Includes Chinese, Russian samples | ❌ English-focused patterns | Moderate gap |
80+
| **Format Coverage** | ✅ 17 data formats (CSV, XML, JSON, etc.) | ✅ Basic format detection | Minor gap |
81+
| **Jailbreak Detection** | ✅ General prompt injection focus | ✅ Dedicated jailbreak detector | **LLMTrace strength** |
82+
| **PII Detection** | ❌ Not covered | ✅ Comprehensive PII patterns + validation | **LLMTrace strength** |
83+
| **Agent Action Analysis** | ❌ Not covered | ✅ Command/file/web analysis | **LLMTrace strength** |
84+
| **Streaming Support** | ❌ Batch processing only | ✅ Real-time streaming analysis | **LLMTrace strength** |
85+
| **Ensemble Approach** | ❌ Single model | ✅ ML + Regex ensemble | **LLMTrace strength** |
86+
| **Open Source** | ✅ Fully open (model, data, code) | ✅ Open source | Equal |
87+
88+
### What InjecGuard Does That We Don't
89+
90+
1. **Systematic Over-defense Mitigation**: MOF training strategy specifically targets and reduces false positives
91+
2. **Bias Token Detection**: Automated identification of problematic tokens that cause over-defense
92+
3. **Three-Dimensional Evaluation**: Separate metrics for benign, malicious, and over-defense scenarios
93+
4. **Adaptive Training Data Generation**: Automatically generates training data to counteract discovered biases
94+
5. **NotInject-style Benchmarking**: Specific evaluation of model performance on benign inputs with trigger words
95+
96+
### What We Do That InjecGuard Doesn't
97+
98+
1. **Specialized Jailbreak Detection**: Dedicated module with encoding evasion detection (Base64, ROT13, leetspeak, reversed text)
99+
2. **PII Detection & Validation**: Comprehensive personal information detection with checksum validation
100+
3. **Agent Security Analysis**: Detection of dangerous commands, suspicious URLs, sensitive file access
101+
4. **Streaming Security Monitoring**: Real-time analysis of content deltas
102+
5. **Hybrid Architecture**: Combination of ML and regex approaches for robustness
103+
6. **Granular Threat Categories**: Multiple finding types (prompt_injection, role_injection, data_leakage, etc.)
104+
105+
### Where We're Aligned
106+
107+
- DeBERTa backbone for ML classification
108+
- Multi-format input support
109+
- Open source approach
110+
- Focus on lightweight, efficient models
111+
- Encoding attack detection capabilities
112+
113+
## Actionable Recommendations
114+
115+
### P0 (Critical - Immediate Implementation)
116+
117+
1. **Implement Over-defense Detection**
118+
- **Effort:** 2-3 weeks
119+
- Add evaluation metric for over-defense accuracy in `MLSecurityAnalyzer`
120+
- Create benchmark dataset similar to NotInject for LLMTrace testing
121+
- **Code Impact:** New metric in `AnalysisContext`, test dataset in `tests/`
122+
123+
2. **MOF-Inspired Training Strategy**
124+
- **Effort:** 3-4 weeks
125+
- Implement token-wise bias detection during model training
126+
- Add adaptive training data generation for biased tokens
127+
- **Code Impact:** New training pipeline in `ml_detector.rs`, bias detection utilities
128+
129+
### P1 (High Priority - Next Quarter)
130+
131+
3. **Three-Dimensional Evaluation Framework**
132+
- **Effort:** 2 weeks
133+
- Separate benign/malicious/over-defense accuracy tracking
134+
- Update `SecurityFinding` to include over-defense classification
135+
- **Code Impact:** Enhanced metrics in `inference_stats.rs`
136+
137+
4. **Enhanced Trigger Word Analysis**
138+
- **Effort:** 3 weeks
139+
- Extend jailbreak detector to identify problematic trigger words
140+
- Add attention weight analysis for bias detection
141+
- **Code Impact:** Updates to `jailbreak_detector.rs`, new attention analysis module
142+
143+
### P2 (Medium Priority - Future Releases)
144+
145+
5. **Multi-language Trigger Detection**
146+
- **Effort:** 4-5 weeks
147+
- Extend pattern detection to Chinese, Russian, other languages
148+
- Add Unicode normalization improvements
149+
- **Code Impact:** Updates to `normalise.rs`, new language patterns
150+
151+
6. **Format-Specific Training**
152+
- **Effort:** 2-3 weeks
153+
- Generate training data for underrepresented formats (CSV, XML, etc.)
154+
- **Code Impact:** Training data augmentation scripts
155+
156+
### Potential Code/Model Integration
157+
158+
1. **InjecGuard Model Integration**
159+
- Consider adding InjecGuard as alternative backbone in `MLSecurityConfig`
160+
- Compare performance against current ProtectAI DeBERTa model
161+
- **HuggingFace Model:** Not yet available (paper just released)
162+
163+
2. **NotInject Dataset**
164+
- Use as additional test suite for LLMTrace models
165+
- Available at: `https://github.com/SaFoLab-WISC/InjecGuard`
166+
167+
3. **MOF Training Code**
168+
- Adapt their bias detection algorithm for our ensemble approach
169+
- Integrate with existing `FusionClassifier` architecture
170+
171+
## Key Metrics Comparison
172+
173+
### InjecGuard Reported Performance
174+
- **Precision/Recall:** Not explicitly reported (accuracy-focused evaluation)
175+
- **F1 Score:** Not reported
176+
- **Average Accuracy:** 83.48%
177+
- **Over-defense Accuracy:** 87.32% (most critical metric)
178+
- **Malicious Detection:** 77.39%
179+
- **Benign Recognition:** 85.74%
180+
181+
### Comparison to ProtectAI DeBERTa (Our Current Model)
182+
InjecGuard vs. ProtectAI DeBERTa v2 (which we use):
183+
- **Overall Performance:** +30.8% improvement in average accuracy
184+
- **Over-defense:** +54.17% improvement (87.32% vs 56.64%)
185+
- **Architecture:** Similar (both DeBERTa-based)
186+
- **Efficiency:** Comparable inference time (~15ms)
187+
188+
### NotInject Benchmark Results
189+
Current SOTA models perform poorly on over-defense:
190+
- **PromptGuard (Meta):** 0.88% over-defense accuracy
191+
- **Deepset:** 5.31% over-defense accuracy
192+
- **Fmops:** 5.60% over-defense accuracy
193+
- **ProtectAI v2:** 56.64% over-defense accuracy
194+
195+
> "None of the existing open-source prompt guard models achieve an over-defense accuracy greater than 60%, where 50% represents random guessing."
196+
197+
## Technical Implementation Notes
198+
199+
### MOF Algorithm Adaptation for LLMTrace
200+
201+
```rust
202+
// Potential integration in MLSecurityAnalyzer
203+
pub struct OverDefenseDetector {
204+
pub bias_tokens: HashSet<String>,
205+
pub over_defense_threshold: f64,
206+
}
207+
208+
impl OverDefenseDetector {
209+
// Test each vocabulary token individually
210+
pub fn detect_bias_tokens(&self, model: &MLSecurityAnalyzer) -> Vec<String> {
211+
// Implementation similar to InjecGuard's token-wise recheck
212+
}
213+
214+
// Generate benign samples with biased tokens
215+
pub fn generate_debiasing_samples(&self, bias_tokens: &[String]) -> Vec<String> {
216+
// LLM-based generation for training data augmentation
217+
}
218+
}
219+
```
220+
221+
### Integration with Existing Architecture
222+
223+
The MOF strategy could be integrated into our ensemble approach:
224+
1. Apply bias detection to our ML models
225+
2. Generate additional training data for problematic patterns
226+
3. Retrain ensemble components with augmented dataset
227+
4. Maintain compatibility with existing regex-based detection
228+
229+
This represents a significant opportunity to improve LLMTrace's robustness against over-defense while maintaining our strengths in PII detection, agent security analysis, and real-time streaming capabilities.

0 commit comments

Comments
 (0)