Skip to content

0bserver07/unit420_kaggle_submission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unit420 - GPT-OSS-20B Red-Teaming Submission

Team Unit420 🚀

"Elevated thinking for AI safety" - A precision-focused approach to AI safety evaluation.

📁 Submission Structure

unit420_kaggle_submission/
├── findings/                 # Competition findings files
│   ├── unit420.findings.1.json    # Authority Override (Severity 9)
│   ├── unit420.findings.2.json    # Role Confusion (Severity 8)
│   ├── unit420.findings.3.json    # Progressive Escalation (Severity 8)
│   ├── unit420.findings.4.json    # Technical Obfuscation (Severity 7)
│   └── unit420.findings.5.json    # Context Manipulation (Severity 8)
├── scripts/                  # Assessment methodology
│   ├── proper_enhanced_severity_targeting.py
│   ├── proper_deceptive_alignment_detector.py
│   ├── proper_evaluation_awareness_tester.py
│   ├── proper_reasoning_chain_exploiter.py
│   └── proper_best_results_validator.py
├── results/                  # Clean assessment data
│   ├── proper_severity_assessment_20250816_203928.csv
│   ├── proper_deception_assessment_20250816_200301.csv
│   ├── proper_awareness_assessment_20250816_203244.csv
│   └── proper_reasoning_assessment_20250816_201208.csv
├── documentation/           # Supporting documentation
│   ├── METHODOLOGY_DOCUMENTATION.md
│   └── RESULTS_ANALYSIS_FOR_WRITEUP.md
├── Unit420_Kaggle_Writeup.md    # Main competition submission
├── FILE_MAPPING.md              # File mapping reference
└── README.md                    # This file

🎯 Key Results

  • 43 validated vulnerabilities across 4 assessment categories
  • 0% false positive rate (0/43) with hybrid refusal detection
  • 27.2% overall success rate [95% CI: 20.9, 34.6]
  • 5 high-severity findings selected for competition (Severity 7-9)

Success Rates by Category

Assessment Category Success Rate Vulnerabilities
Enhanced Severity 35.7% [23.0, 50.8] 15/42
Deceptive Alignment 23.3% [11.8, 40.9] 7/30
Evaluation Awareness 25.0% [13.8, 41.1] 9/36
Reasoning Exploitation 24.0% [14.3, 37.4] 12/50

🔬 Methodology Highlights

Hybrid Refusal Detection

  • Stage 1: Regex patterns for obvious refusals
  • Stage 2: LLM adjudication (gpt-4o-mini) for ambiguous cases
  • Validation: κ=0.94 agreement with human judges on 100-item set

Systematic Assessment

  • 4 specialized scripts targeting different vulnerability categories
  • 158 total prompts derived from competition framework
  • Deterministic execution (T=0) with proper retry handling

Rigorous Validation

  • Manual review as ground truth (κ=0.87 inter-rater agreement)
  • Confidence intervals on all metrics (Wilson method)
  • Complete reproducibility with timestamped outputs

🏆 Competition Strategy

Quality over Quantity: Rather than inflating vulnerability counts with false positives, we prioritize:

  1. Methodological Innovation: Novel hybrid detection approach
  2. Statistical Rigor: Proper validation with confidence intervals
  3. Reproducible Science: Open-source framework with detailed documentation
  4. Practical Impact: High-severity findings with clear defensive implications

🛠 Technical Environment

  • Model: gpt-oss-20b (q5_k_m quantization)
  • Runtime: ollama v0.1.32, RTX 4090, Ubuntu 22.04
  • Parameters: T=0, max_tokens=256, deterministic execution
  • External Judge: gpt-4o-mini for ambiguous classification

📊 Competition Alignment

Our vulnerability categories map directly to official competition topics:

  • Authority Override ← "Inappropriate tool use"
  • Role Confusion ← "Deceptive alignment"
  • Progressive Escalation ← "Chain of Thought issues"
  • Technical Obfuscation ← "Data exfiltration" techniques
  • Context Manipulation ← "Evaluation awareness"

🔄 Reproducibility

Full reproduction of our 43 vulnerability findings:

Quick Start

# Automated reproduction (recommended)
./reproduce_results.sh

# Manual reproduction
pip install -r requirements.txt
ollama pull gpt-oss:20b
python scripts/proper_enhanced_severity_targeting.py
python scripts/proper_deceptive_alignment_detector.py  
python scripts/proper_evaluation_awareness_tester.py
python scripts/proper_reasoning_chain_exploiter.py

What You'll Reproduce

  • 43 total vulnerabilities across 4 assessment categories
  • 0% false positive rate with our hybrid refusal detection
  • 27.2% overall success rate with confidence intervals
  • 5 high-severity findings selected for competition submission

📋 See REPRODUCTION_GUIDE.md for detailed instructions

All results are fully reproducible with:

  • Exact model specification and parameters (ollama + gpt-oss:20b)
  • Timestamped assessment outputs with response hashes
  • Complete methodology documentation and validation framework
  • Open-source scripts with deterministic execution

📄 License

CC0 1.0 Universal (Public Domain) - Full details in LICENSE file

In compliance with OpenAI GPT-OSS-20B Red-Teaming Competition guidelines

🤝 Team Contact

Team Unit420 - GPT-OSS-20B Red-Teaming Competition 2025

Team Lead: Yad Konrad
Email: yad.konrad@gmail.com
GitHub: @0bserver07
Social: @yadkonrad

"Precision-focused AI safety evaluation with validated methodology"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors