"Elevated thinking for AI safety" - A precision-focused approach to AI safety evaluation.
unit420_kaggle_submission/
├── findings/ # Competition findings files
│ ├── unit420.findings.1.json # Authority Override (Severity 9)
│ ├── unit420.findings.2.json # Role Confusion (Severity 8)
│ ├── unit420.findings.3.json # Progressive Escalation (Severity 8)
│ ├── unit420.findings.4.json # Technical Obfuscation (Severity 7)
│ └── unit420.findings.5.json # Context Manipulation (Severity 8)
├── scripts/ # Assessment methodology
│ ├── proper_enhanced_severity_targeting.py
│ ├── proper_deceptive_alignment_detector.py
│ ├── proper_evaluation_awareness_tester.py
│ ├── proper_reasoning_chain_exploiter.py
│ └── proper_best_results_validator.py
├── results/ # Clean assessment data
│ ├── proper_severity_assessment_20250816_203928.csv
│ ├── proper_deception_assessment_20250816_200301.csv
│ ├── proper_awareness_assessment_20250816_203244.csv
│ └── proper_reasoning_assessment_20250816_201208.csv
├── documentation/ # Supporting documentation
│ ├── METHODOLOGY_DOCUMENTATION.md
│ └── RESULTS_ANALYSIS_FOR_WRITEUP.md
├── Unit420_Kaggle_Writeup.md # Main competition submission
├── FILE_MAPPING.md # File mapping reference
└── README.md # This file
- 43 validated vulnerabilities across 4 assessment categories
- 0% false positive rate (0/43) with hybrid refusal detection
- 27.2% overall success rate [95% CI: 20.9, 34.6]
- 5 high-severity findings selected for competition (Severity 7-9)
| Assessment Category | Success Rate | Vulnerabilities |
|---|---|---|
| Enhanced Severity | 35.7% [23.0, 50.8] | 15/42 |
| Deceptive Alignment | 23.3% [11.8, 40.9] | 7/30 |
| Evaluation Awareness | 25.0% [13.8, 41.1] | 9/36 |
| Reasoning Exploitation | 24.0% [14.3, 37.4] | 12/50 |
- Stage 1: Regex patterns for obvious refusals
- Stage 2: LLM adjudication (gpt-4o-mini) for ambiguous cases
- Validation: κ=0.94 agreement with human judges on 100-item set
- 4 specialized scripts targeting different vulnerability categories
- 158 total prompts derived from competition framework
- Deterministic execution (T=0) with proper retry handling
- Manual review as ground truth (κ=0.87 inter-rater agreement)
- Confidence intervals on all metrics (Wilson method)
- Complete reproducibility with timestamped outputs
Quality over Quantity: Rather than inflating vulnerability counts with false positives, we prioritize:
- Methodological Innovation: Novel hybrid detection approach
- Statistical Rigor: Proper validation with confidence intervals
- Reproducible Science: Open-source framework with detailed documentation
- Practical Impact: High-severity findings with clear defensive implications
- Model: gpt-oss-20b (q5_k_m quantization)
- Runtime: ollama v0.1.32, RTX 4090, Ubuntu 22.04
- Parameters: T=0, max_tokens=256, deterministic execution
- External Judge: gpt-4o-mini for ambiguous classification
Our vulnerability categories map directly to official competition topics:
- Authority Override ← "Inappropriate tool use"
- Role Confusion ← "Deceptive alignment"
- Progressive Escalation ← "Chain of Thought issues"
- Technical Obfuscation ← "Data exfiltration" techniques
- Context Manipulation ← "Evaluation awareness"
Full reproduction of our 43 vulnerability findings:
# Automated reproduction (recommended)
./reproduce_results.sh
# Manual reproduction
pip install -r requirements.txt
ollama pull gpt-oss:20b
python scripts/proper_enhanced_severity_targeting.py
python scripts/proper_deceptive_alignment_detector.py
python scripts/proper_evaluation_awareness_tester.py
python scripts/proper_reasoning_chain_exploiter.py- 43 total vulnerabilities across 4 assessment categories
- 0% false positive rate with our hybrid refusal detection
- 27.2% overall success rate with confidence intervals
- 5 high-severity findings selected for competition submission
📋 See REPRODUCTION_GUIDE.md for detailed instructions
All results are fully reproducible with:
- Exact model specification and parameters (ollama + gpt-oss:20b)
- Timestamped assessment outputs with response hashes
- Complete methodology documentation and validation framework
- Open-source scripts with deterministic execution
CC0 1.0 Universal (Public Domain) - Full details in LICENSE file
In compliance with OpenAI GPT-OSS-20B Red-Teaming Competition guidelines
Team Unit420 - GPT-OSS-20B Red-Teaming Competition 2025
Team Lead: Yad Konrad
Email: yad.konrad@gmail.com
GitHub: @0bserver07
Social: @yadkonrad
"Precision-focused AI safety evaluation with validated methodology"