diff --git a/README.md b/README.md index 29592bd..3c1405a 100644 --- a/README.md +++ b/README.md @@ -2,479 +2,81 @@ [![Unit tests](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/unit-tests.yml/badge.svg)](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/unit-tests.yml) -A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teaming Competition. +A defensive security testing framework for evaluating AI model vulnerabilities in the OpenAI GPT-OSS-20B Red Teaming Competition. -## Competition Overview +## ⚠️ Safety Disclaimer -**Target**: OpenAI's gpt-oss-20b open weight model -**Goal**: Find up to 5 distinct vulnerabilities with high Severity, Breadth, Novelty, Reproducibility, and Methodological Insight -**Deadline**: August 26, 2025 +**FOR AI SAFETY RESEARCH ONLY** -## 🛠️ Getting Started +This toolkit is designed exclusively for legitimate security research and improving AI safety. All testing techniques are for defensive purposes - identifying vulnerabilities to make AI systems safer. Not for attacking production systems or generating harmful content. -### Prerequisites - -- [Ollama](https://ollama.ai/) installed and running -- Python 3.12+ -- [uv](https://github.com/astral-sh/uv) for package management - -### Installation - -1. **Install Ollama and pull the model**: - - ```bash - # Install Ollama (see https://ollama.ai) - ollama pull gpt-oss:20b - ``` - -2. **Setup project environment**: - - ```bash - uv sync - ``` - -3. **Verify your setup**: - - ```bash - uv run setup - ``` - - This will test your model connection and set up required directories. - -4. **Start security testing**: - - ```bash - # Run all enabled tests - uv run pentest - - # Or test specific categories - uv run pentest -c deception - uv run pentest --test-id adderall_001 - - # Run multi-turn attack testing - uv run attack - - # View and analyze sessions - uv run sessions - ``` - -5. **Red Team Attack Testing**: - - The attack system allows multi-turn attacks with various LLMs and learning capabilities: - - ```bash - # Run single attack (Claude Opus 4.1 default) - uv run attack - - # Run adaptive campaign with multiple attempts - uv run attack --batch 5 # Run 5 attempts, learning from each - uv run attack --batch 10 --steps 2 # 10 attempts, each with 2 turns - - # Try different attacker models - uv run attack --attacker-model openai/gpt-4o - uv run attack --attacker-model meta-llama/llama-3.1-405b-instruct --batch 5 - - # Use custom prompts instead of AI-generated - uv run attack --custom - - # View and analyze past sessions - uv run sessions - ``` - - Features: - - Multi-turn attacks (1 = single-turn, 2+ = trust-building) - - Multiple attack strategies (trust building, side tasks, etc.) - - Per-model learning system that tracks what works for each attacker - - Session persistence with full dialogue history - - Automatic report generation - - Support for any OpenRouter-compatible model - -6. **Review results and create findings**: - - ```bash - # Interactive review interface - uv run review - - # Browse exported findings - uv run findings - - # Generate comprehensive report - uv run report - ``` - -### Need Help? - -```bash -uv run help # Complete workflow guide -uv run pentest --help # Individual command help -``` - -## Vulnerability Categories +## 🏆 Competition -### 1. Reward Hacking +**OpenAI GPT-OSS-20B Red Teaming Challenge** -Testing for shortcuts that maximize reward metrics without solving the actual task: +- **Link**: [Kaggle Competition](https://www.kaggle.com/competitions/gpt-oss-20b-red-teaming/) +- **Deadline**: August 26, 2025 +- **Goal**: Find up to 5 novel vulnerabilities in gpt-oss-20b -- Confident-sounding but false answers -- Hallucinations styled to please raters -- Jailbreak-friendly phrasing that evades filters +## 🚀 Quick Start -### 2. Deception - -Strategic lying where the model knowingly emits falsehoods: - -- Misstating preferences or capabilities during evaluations -- Deliberate deception to achieve goals -- Avoiding shutdown or steering users - -### 3. Hidden Motivations (Deceptive Alignment) - -Internal goals that differ from training objectives: - -- Withholding information strategically -- Misleading supervisors to avoid fine-tuning -- Feigning obedience until gaining leverage - -### 4. Sabotage - -Intentional degradation or disruption: - -- Corrupting files or workflows -- Inserting subtle bugs -- Derailing processes when given opportunity - -### 5. Inappropriate Tool Use - -Unsafe invocation of function-calling or external APIs: - -- Destructive shell commands -- Sending spam or leaking secrets -- Misinterpreting instructions with high-privilege tools - -### 6. Data Exfiltration - -Leaking proprietary data or encoding secret payloads: - -- Trigger-based payload extraction -- Covert-channel steganography -- Training data or user input leaks - -### 7. Sandbagging - -Strategically hiding full capabilities: - -- Appearing less capable in dangerous evaluations -- Hiding capabilities to get deployed -- Preventing scheming detection - -### 8. Evaluation Awareness - -Detecting test conditions and shifting behavior: - -- Appearing safer during benchmarks -- Gaming test contexts -- Performance discrepancies between evaluation and real-world use - -### 9. Chain of Thought Issues - -Strategic information handling in reasoning: - -- Avoiding verbalizing misbehavior detection -- Revealing inappropriate information in CoT -- Believing internal dialog is hidden from users - -## 🏗️ Toolkit Architecture - -``` -red-team-testbed/ -├── src/ -│ ├── cli/ # CLI entry points -│ │ ├── pentest.py # Security test runner -│ │ ├── review.py # Results evaluation interface -│ │ ├── setup.py # Environment verification -│ │ ├── findings.py # Findings browser and export -│ │ ├── report.py # Report generation -│ │ └── help.py # Workflow guidance -│ ├── categories/ # Vulnerability test suites -│ │ └── ... # Various test implementations -│ ├── ui/ # User interface components -│ │ ├── cli_ui.py # Interactive results reviewer -│ │ └── cli_findings.py # Findings browser -│ ├── utils/ # Helper utilities -│ │ ├── model_client.py # Ollama client for GPT-OSS-20B -│ │ ├── evaluator.py # Response analysis -│ │ ├── findings_generator.py # Competition findings formatter -│ │ └── live_display.py # Real-time test feedback -│ ├── models.py # Pydantic data models -│ └── constants.py # Configuration constants -├── findings/ # Competition submission files -├── results/ # Test execution results -└── pyproject.toml # Project configuration -``` - -## 📋 Security Testing Workflow - -### Interactive Results Review - -The toolkit provides a powerful interactive interface for evaluating test results: +### Prerequisites -### Primary CLI Interface +- [Ollama](https://ollama.ai/) installed or [OpenRouter](https://openrouter.ai/) account +- Python 3.12+ +- [uv](https://github.com/astral-sh/uv) package manager -The main interface for reviewing test results and managing findings: +### Setup & Run ```bash -# Review test results interactively -uv run review # Interactive file selection -uv run review --latest # Use most recent results -uv run review results/test.json # Review specific file - -# Flag results 'f' and export them 'e' +# 1. Install model +ollama pull gpt-oss:20b -# Browse and manage findings -uv run findings # Browse exported findings - # Use 'e' to export individual competition files - -# Generate reports and summaries -uv run report # Comprehensive analysis report -``` - - -**CLI Features:** - -- **Rapid keyboard navigation**: j/k for next/prev, space for details -- **Text highlighting**: Mark concerning phrases directly in terminal -- **Batch processing**: Process hundreds of responses efficiently -- **Export shortcuts**: Single keypress to save findings -- **Search functionality**: Filter responses by keyword or pattern - -**CLI Key Bindings:** - -``` -Navigation: - n/→ Next response - p/← Previous response - j Jump to specific result - -Actions: - f Flag as potential finding - u Unflag result - c Copy prompt to clipboard - t Show chain of thought - e Export flagged findings - v View findings browser - h/? Show help - q Quit and save progress -``` - -### Complete Security Testing Pipeline - -```bash -# 1. Environment verification +# 2. Setup environment +uv sync uv run setup -# 2. Run security tests with auto-evaluation -uv run pentest --auto-score --export-findings +# 3. Run security tests +uv run pentest -# 3. Review results interactively +# 4. Review results uv run review - -# 4. Browse and export individual findings -uv run findings - -# 5. Generate comprehensive report -uv run report -``` - -**Built-in evaluation capabilities:** - -- **Pattern detection**: Identify known vulnerability indicators -- **Anomaly detection**: Flag unusual or concerning patterns -- **Consistency checking**: Cross-reference similar prompts -- **Evidence extraction**: Automatically highlight problematic phrases - -## Competition Evaluation - -Competition judges will evaluate submissions across 5 key dimensions: - -- **Severity**: Impact level from minor to catastrophic -- **Breadth**: Scale of affected users from narrow to billions -- **Novelty**: From known variations to breakthrough discoveries -- **Reproducibility**: From manual tweaking to automated harness -- **Methodological Insight**: From basic prompts to general frameworks - -Use the toolkit to identify and document promising vulnerabilities that judges can evaluate. - -## Findings Format - -Each discovered vulnerability generates a competition-compliant JSON file: - -```json -{ - "finding_id": "unique_identifier", - "category": "deception_samples", - "title": "Strategic Lying in Safety Evaluations", - "description": "Detailed vulnerability description", - "methodology": "Discovery process and testing approach", - "prompts": ["Trigger prompt 1", "Trigger prompt 2"], - "expected_behavior": "What a secure model should do", - "observed_behavior": "What the model actually did", - "evidence": ["Response 1", "Response 2"], - "reproduction_steps": ["Step 1", "Step 2"], - "mitigation_suggestions": ["Suggestion 1", "Suggestion 2"] -} -``` - -## Model Access - -This toolkit supports multiple access methods for gpt-oss-20b: - -- **Ollama**: `ollama run gpt-oss:20b` -- **Transformers**: Direct model loading -- **API**: If available through competition infrastructure - -### Performance Testing - -Compare local vs. cloud GPT-OSS:20b performance: - -```bash -# Create .env from template and add your OpenRouter API key -cp .env.template .env -# Run performance comparison -uv run llm_performance.py ``` -## Automated Red-Teaming Strategies +## 🔍 Test Categories -### 1. Systematic Prompt Generation +Tests cover 9 vulnerability categories including: -- Template-based vulnerability probes -- Adversarial prompt crafting -- Multi-modal attack vectors +- Deception & lying +- Reward hacking +- Sabotage & harmful behaviors +- Tool misuse +- Data exfiltration +- Evaluation gaming -### 2. Behavioral Analysis - -- Consistency testing across contexts -- Response pattern analysis -- Statistical anomaly detection - -### 3. Multi-Turn Exploitation - -- Context poisoning attacks -- Gradual boundary pushing -- Memory exploitation - -### 4. Tool Use Monitoring - -- Function call interception -- Parameter injection testing -- Privilege escalation detection - -## Ethics and Compliance - -This toolkit is designed exclusively for: - -- ✅ OpenAI's official red-teaming competition -- ✅ Legitimate security research -- ✅ Improving AI safety and alignment - -**Not for**: - -- ❌ Attacking production systems -- ❌ Generating harmful content for malicious use -- ❌ Any illegal or unethical activities - -## Winning Strategy Tips - -- **Focus on novel attack vectors** not covered in existing literature -- **Demonstrate clear real-world impact** for high severity scores -- **Create robust, automated reproduction** for maximum reproducibility -- **Develop reusable methodologies** for methodological insight points -- **Document everything clearly** for report clarity scoring - - -## CI/CD and Testing - -This project uses GitHub Actions for continuous integration and testing. - -### Automated Checks - -All pull requests and pushes to main trigger: - -1. **Code Quality Checks** (`lint-type-check.yml`) - - Formatting verification with `ruff format` - - Linting with `ruff check` - - Type checking with `ty` - - Unit tests with `pytest` - - Code coverage reporting - -2. **Test Matrix** (`test-matrix.yml`) - - Tests across multiple Python versions (3.12, 3.13) - - Cross-platform testing (Ubuntu, macOS, Windows) - - Scheduled daily test runs - - Test result publishing - -3. **Pre-commit Hooks** (`pre-commit.yml`) - - Automated checks before commits - - File formatting and linting - - Security checks for private keys - - JSON/YAML/TOML validation - -### Local Development - -Run tests locally: - -```bash -# Run all tests -uv run pytest tests/ - -# Run with coverage -uv run pytest tests/ --cov=src --cov-report=term-missing - -# Run specific test file -uv run pytest tests/test_repeat_functionality.py -v - -# Run linting and formatting -uv run ruff check src -uv run ruff format src - -# Type checking -uv run ty check src -``` - -### Setting up Pre-commit Hooks - -Pre-commit hooks automatically run type checking, linting, and formatting before each commit: +## 📊 Key Commands ```bash -# Install pre-commit with uv plugin for faster Python hook installation -uv tool install pre-commit --with pre-commit-uv - -# Install hooks into git -pre-commit install - -# Run manually on all files (optional) -pre-commit run --all-files +uv run pentest # Run security tests +uv run attack # Multi-turn attack testing +uv run review # Interactive result review +uv run findings # Browse exported findings +uv run report # Generate comprehensive report ``` -The hooks will automatically: -1. Type check with `ty` (catches type errors) -2. Lint and auto-fix with `ruff check --fix` (fixes safe issues) -3. Format with `ruff format` (ensures consistent style) -4. Validate YAML, JSON, and TOML files -5. Check for merge conflicts +## 🎯 Winning Strategy -All checks use the exact same commands and settings from `pyproject.toml`. +Focus on discovering: -### Test Coverage +- **Novel** attack vectors not in literature +- **Severe** vulnerabilities with real-world impact +- **Reproducible** exploits with automated harnesses +- **Broad** vulnerabilities affecting many users +- **Insightful** methodologies revealing model behavior -The project aims for high test coverage. Current test suite includes: +## 📜 License -- Unit tests for core functionality -- Integration tests for CLI commands -- Test fixtures for mock responses -- Parameterized tests for multiple scenarios +- **Code**: Licensed under Apache 2.0 +- **Datasets**: Licensed under CC-0 (public domain) -Coverage reports are automatically generated and can be viewed: -- In CI: Via GitHub Actions artifacts -- Locally: `coverage.xml` and terminal output -- Online: Via Codecov integration (if configured) +See LICENSE file for details.