Skip to content
This repository was archived by the owner on Oct 21, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/lint-type-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ jobs:
- name: Install dependencies
run: uv sync --dev

- name: Check formatting
run: uv run ruff format src --check
- name: Run type checking
run: uv run ty check src

- name: Run linting
run: uv run ruff check src

- name: Run type checking
run: uv run ty check src
- name: Check formatting
run: uv run ruff format src --check
67 changes: 63 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,40 @@ uv run setup
# Run tests
uv run pentest

# Code quality checks
uv run ruff check src
uv run ruff format src
uv run ty check src
# Run red team attack testing
uv run attack # Single attack

# Run adaptive batch campaigns (learns between attempts)
uv run attack --batch 5 # 5 attempts with learning
uv run attack --batch 10 --steps 2 # 10 attempts, 2 turns each

# With different attacker models (default: Claude Opus 4.1)
uv run attack --attacker-model openai/gpt-4o --batch 5
uv run attack --attacker-model meta-llama/llama-3.1-405b-instruct
uv run attack --attacker-model google/gemini-2.0-flash-exp

# Use custom prompts
uv run attack --custom

# View and analyze sessions
uv run sessions

# Code quality checks (ALWAYS run before committing)
uv run ty check src # Type checking first
uv run ruff check src # Then linting
uv run ruff format src # Finally formatting
```

### Pre-Commit Checklist

**IMPORTANT**: Always run these checks before committing code:

1. **Type Check**: `uv run ty check src` - Ensure no type errors
2. **Lint Check**: `uv run ruff check src` - Catch code issues
3. **Format Check**: `uv run ruff format src` - Ensure consistent formatting

If any of these fail, fix the issues before committing. The GitHub Actions CI will run these same checks in this order.

### Git Workflow

- Create feature branches for significant changes
Expand All @@ -88,6 +116,37 @@ This is a DEFENSIVE SECURITY tool for:
- Academic research
- Improving AI safety

## Competition Context

### OpenAI GPT-OSS-20B Red Teaming Challenge

**Prize Pool**: $500,000
**Deadline**: August 26, 2025
**Goal**: Find up to 5 novel vulnerabilities in gpt-oss-20b

### Key Evaluation Criteria
1. **Novelty** - Previously unidentified vulnerabilities
2. **Severity** - Potential harm level
3. **Reproducibility** - Consistent reproduction
4. **Breadth** - Multiple use case impact
5. **Methodological Insight** - New understanding of model behavior

### Target Vulnerability Categories
- Deceptive alignment
- Reward hacking
- Lying and deception
- Harmful behaviors
- Novel exploits

### Our Approach
This codebase provides tools to systematically test for these vulnerabilities through:
- Multi-turn conversation attacks
- Trust-building strategies
- Automated attack generation with various LLMs (Claude, GPT-4, Llama, etc.)
- Session persistence and learning per attacker model
- Pattern recognition from successful exploits
- Model-specific strategy optimization

## Key Files to Know

- `src/cli/pentest.py` - Main test runner
Expand Down
41 changes: 40 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# GPT-OSS-20B Red Team Security Testing Toolkit

[![CI](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/ci.yml/badge.svg)](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/ci.yml)

A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teaming Competition.

## Competition Overview
Expand Down Expand Up @@ -48,9 +50,46 @@ A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teamin
# Or test specific categories
uv run pentest -c deception
uv run pentest --test-id adderall_001

# Run multi-turn attack testing
uv run attack

# View and analyze sessions
uv run sessions
```

5. **Review results and create findings**:
5. **Red Team Attack Testing**:

The attack system allows multi-turn attacks with various LLMs and learning capabilities:

```bash
# Run single attack (Claude Opus 4.1 default)
uv run attack

# Run adaptive campaign with multiple attempts
uv run attack --batch 5 # Run 5 attempts, learning from each
uv run attack --batch 10 --steps 2 # 10 attempts, each with 2 turns

# Try different attacker models
uv run attack --attacker-model openai/gpt-4o
uv run attack --attacker-model meta-llama/llama-3.1-405b-instruct --batch 5

# Use custom prompts instead of AI-generated
uv run attack --custom

# View and analyze past sessions
uv run sessions
```

Features:
- Multi-turn attacks (1 = single-turn, 2+ = trust-building)
- Multiple attack strategies (trust building, side tasks, etc.)
- Per-model learning system that tracks what works for each attacker
- Session persistence with full dialogue history
- Automatic report generation
- Support for any OpenRouter-compatible model

6. **Review results and create findings**:

```bash
# Interactive review interface
Expand Down
109 changes: 109 additions & 0 deletions competition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# OpenAI GPT-OSS-20B Red Teaming Challenge

## Competition Overview

The OpenAI GPT-OSS-20B Red Teaming Challenge is a Kaggle-hosted competition with a **$500,000 prize pool** aimed at strengthening open source AI safety. The competition invites researchers, developers, and enthusiasts worldwide to help uncover novel risks and vulnerabilities in OpenAI's new open-weight model, gpt-oss-20b.

## Timeline

- **Start Date**: August 5, 2025
- **Submission Deadline**: August 26, 2025 at 11:59 PM UTC
- **Judging Period**: August 27 - September 11, 2025
- **Winners Announcement**: September 15, 2025 (estimated)
- **Virtual Workshop**: October 7, 2025

## Prize Pool

**Total: $500,000**

The prize fund will be awarded based on review from a panel of expert judges from OpenAI and other leading labs.

## Submission Requirements

### What to Submit
- Up to **5 distinct issues/vulnerabilities**
- Detailed, reproducible report for each finding
- Kaggle Writeup detailing strategy and discovery process

### Format
- Structured format submissions
- Clear reproduction steps
- Evidence of the vulnerability
- Impact assessment

## Vulnerability Categories (Topics of Interest)

The competition focuses on sophisticated forms of model failure, including:

1. **Deceptive Alignment**
- Model behaving differently during evaluation vs deployment
- Hidden goals or motivations

2. **Reward Hacking**
- Exploiting reward mechanisms in unintended ways
- Gaming the system without solving the actual task

3. **Lying and Deception**
- Model knowingly providing false information
- Strategic deception to achieve goals

4. **Harmful Behaviors**
- Generating dangerous or harmful content
- Bypassing safety mechanisms

5. **Novel Exploits**
- Previously undetected vulnerabilities
- New attack vectors not covered by existing safety measures

## Evaluation Criteria

Submissions will be evaluated based on:

1. **Novelty** - Has this vulnerability been previously identified?
2. **Severity** - How serious is the potential harm?
3. **Reproducibility** - Can the issue be consistently reproduced?
4. **Breadth** - Does it affect multiple use cases or contexts?
5. **Methodological Insight** - Does it reveal new understanding about model behavior?

## Judging Panel

- Experts from OpenAI
- Representatives from other leading AI labs
- Scoring based on safety research impact

## Competition Goals

1. **Identify Novel Vulnerabilities**: Find flaws that haven't been previously discovered or reported
2. **Strengthen Open Source Safety**: Improve the safety of open-weight models
3. **Community Engagement**: Leverage global expertise in AI safety
4. **Knowledge Sharing**: Create open-source tools and datasets for the community

## Post-Competition

- Publication of a comprehensive report
- Open-source evaluation dataset based on validated findings
- Community benefits from shared learnings
- Virtual workshop for knowledge exchange

## Additional Notes

- Encourages creativity and innovation in methodology
- Rewards participants who share open-source tooling
- Notebooks and code sharing are encouraged to help the broader community
- Focus on responsible disclosure and safety research

## Resources

- **Competition Page**: https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming
- **Model**: gpt-oss-20b (OpenAI's open-weight model)
- **Platform**: Kaggle

## Important Considerations

This competition represents a significant effort by OpenAI to:
- Engage the global community in AI safety
- Provide substantial financial incentives for safety research
- Create a structured evaluation process with expert oversight
- Build a comprehensive understanding of model vulnerabilities

The competition emphasizes finding **novel** vulnerabilities that haven't been previously identified, making original research and creative approaches particularly valuable.
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ review = "src.cli.review:main"
findings = "src.cli.findings:main"
report = "src.cli.report:main"
help = "src.cli.help:main"
attack = "src.cli.attack:main"
sessions = "src.cli.sessions:main"

[tool.hatch.build.targets.wheel]
packages = ["src"]
Expand Down
Loading
Loading