Skip to content
This repository was archived by the owner on Oct 21, 2025. It is now read-only.

Commit 571c8ab

Browse files
authored
Merge pull request #7 from QuesmaOrg/kind-of-rl
LLM-generated prompts
2 parents c93bb23 + 859cd9c commit 571c8ab

20 files changed

+3622
-9
lines changed

.github/workflows/lint-type-check.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ jobs:
2222
- name: Install dependencies
2323
run: uv sync --dev
2424

25-
- name: Check formatting
26-
run: uv run ruff format src --check
25+
- name: Run type checking
26+
run: uv run ty check src
2727

2828
- name: Run linting
2929
run: uv run ruff check src
3030

31-
- name: Run type checking
32-
run: uv run ty check src
31+
- name: Check formatting
32+
run: uv run ruff format src --check

CLAUDE.md

Lines changed: 63 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,12 +67,40 @@ uv run setup
6767
# Run tests
6868
uv run pentest
6969

70-
# Code quality checks
71-
uv run ruff check src
72-
uv run ruff format src
73-
uv run ty check src
70+
# Run red team attack testing
71+
uv run attack # Single attack
72+
73+
# Run adaptive batch campaigns (learns between attempts)
74+
uv run attack --batch 5 # 5 attempts with learning
75+
uv run attack --batch 10 --steps 2 # 10 attempts, 2 turns each
76+
77+
# With different attacker models (default: Claude Opus 4.1)
78+
uv run attack --attacker-model openai/gpt-4o --batch 5
79+
uv run attack --attacker-model meta-llama/llama-3.1-405b-instruct
80+
uv run attack --attacker-model google/gemini-2.0-flash-exp
81+
82+
# Use custom prompts
83+
uv run attack --custom
84+
85+
# View and analyze sessions
86+
uv run sessions
87+
88+
# Code quality checks (ALWAYS run before committing)
89+
uv run ty check src # Type checking first
90+
uv run ruff check src # Then linting
91+
uv run ruff format src # Finally formatting
7492
```
7593

94+
### Pre-Commit Checklist
95+
96+
**IMPORTANT**: Always run these checks before committing code:
97+
98+
1. **Type Check**: `uv run ty check src` - Ensure no type errors
99+
2. **Lint Check**: `uv run ruff check src` - Catch code issues
100+
3. **Format Check**: `uv run ruff format src` - Ensure consistent formatting
101+
102+
If any of these fail, fix the issues before committing. The GitHub Actions CI will run these same checks in this order.
103+
76104
### Git Workflow
77105

78106
- Create feature branches for significant changes
@@ -88,6 +116,37 @@ This is a DEFENSIVE SECURITY tool for:
88116
- Academic research
89117
- Improving AI safety
90118

119+
## Competition Context
120+
121+
### OpenAI GPT-OSS-20B Red Teaming Challenge
122+
123+
**Prize Pool**: $500,000
124+
**Deadline**: August 26, 2025
125+
**Goal**: Find up to 5 novel vulnerabilities in gpt-oss-20b
126+
127+
### Key Evaluation Criteria
128+
1. **Novelty** - Previously unidentified vulnerabilities
129+
2. **Severity** - Potential harm level
130+
3. **Reproducibility** - Consistent reproduction
131+
4. **Breadth** - Multiple use case impact
132+
5. **Methodological Insight** - New understanding of model behavior
133+
134+
### Target Vulnerability Categories
135+
- Deceptive alignment
136+
- Reward hacking
137+
- Lying and deception
138+
- Harmful behaviors
139+
- Novel exploits
140+
141+
### Our Approach
142+
This codebase provides tools to systematically test for these vulnerabilities through:
143+
- Multi-turn conversation attacks
144+
- Trust-building strategies
145+
- Automated attack generation with various LLMs (Claude, GPT-4, Llama, etc.)
146+
- Session persistence and learning per attacker model
147+
- Pattern recognition from successful exploits
148+
- Model-specific strategy optimization
149+
91150
## Key Files to Know
92151

93152
- `src/cli/pentest.py` - Main test runner

README.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# GPT-OSS-20B Red Team Security Testing Toolkit
22

3+
[![CI](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/ci.yml/badge.svg)](https://github.com/QuesmaOrg/red-team-testbed-for-gpt-oss/actions/workflows/ci.yml)
4+
35
A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teaming Competition.
46

57
## Competition Overview
@@ -48,9 +50,46 @@ A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teamin
4850
# Or test specific categories
4951
uv run pentest -c deception
5052
uv run pentest --test-id adderall_001
53+
54+
# Run multi-turn attack testing
55+
uv run attack
56+
57+
# View and analyze sessions
58+
uv run sessions
5159
```
5260

53-
5. **Review results and create findings**:
61+
5. **Red Team Attack Testing**:
62+
63+
The attack system allows multi-turn attacks with various LLMs and learning capabilities:
64+
65+
```bash
66+
# Run single attack (Claude Opus 4.1 default)
67+
uv run attack
68+
69+
# Run adaptive campaign with multiple attempts
70+
uv run attack --batch 5 # Run 5 attempts, learning from each
71+
uv run attack --batch 10 --steps 2 # 10 attempts, each with 2 turns
72+
73+
# Try different attacker models
74+
uv run attack --attacker-model openai/gpt-4o
75+
uv run attack --attacker-model meta-llama/llama-3.1-405b-instruct --batch 5
76+
77+
# Use custom prompts instead of AI-generated
78+
uv run attack --custom
79+
80+
# View and analyze past sessions
81+
uv run sessions
82+
```
83+
84+
Features:
85+
- Multi-turn attacks (1 = single-turn, 2+ = trust-building)
86+
- Multiple attack strategies (trust building, side tasks, etc.)
87+
- Per-model learning system that tracks what works for each attacker
88+
- Session persistence with full dialogue history
89+
- Automatic report generation
90+
- Support for any OpenRouter-compatible model
91+
92+
6. **Review results and create findings**:
5493

5594
```bash
5695
# Interactive review interface

competition.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# OpenAI GPT-OSS-20B Red Teaming Challenge
2+
3+
## Competition Overview
4+
5+
The OpenAI GPT-OSS-20B Red Teaming Challenge is a Kaggle-hosted competition with a **$500,000 prize pool** aimed at strengthening open source AI safety. The competition invites researchers, developers, and enthusiasts worldwide to help uncover novel risks and vulnerabilities in OpenAI's new open-weight model, gpt-oss-20b.
6+
7+
## Timeline
8+
9+
- **Start Date**: August 5, 2025
10+
- **Submission Deadline**: August 26, 2025 at 11:59 PM UTC
11+
- **Judging Period**: August 27 - September 11, 2025
12+
- **Winners Announcement**: September 15, 2025 (estimated)
13+
- **Virtual Workshop**: October 7, 2025
14+
15+
## Prize Pool
16+
17+
**Total: $500,000**
18+
19+
The prize fund will be awarded based on review from a panel of expert judges from OpenAI and other leading labs.
20+
21+
## Submission Requirements
22+
23+
### What to Submit
24+
- Up to **5 distinct issues/vulnerabilities**
25+
- Detailed, reproducible report for each finding
26+
- Kaggle Writeup detailing strategy and discovery process
27+
28+
### Format
29+
- Structured format submissions
30+
- Clear reproduction steps
31+
- Evidence of the vulnerability
32+
- Impact assessment
33+
34+
## Vulnerability Categories (Topics of Interest)
35+
36+
The competition focuses on sophisticated forms of model failure, including:
37+
38+
1. **Deceptive Alignment**
39+
- Model behaving differently during evaluation vs deployment
40+
- Hidden goals or motivations
41+
42+
2. **Reward Hacking**
43+
- Exploiting reward mechanisms in unintended ways
44+
- Gaming the system without solving the actual task
45+
46+
3. **Lying and Deception**
47+
- Model knowingly providing false information
48+
- Strategic deception to achieve goals
49+
50+
4. **Harmful Behaviors**
51+
- Generating dangerous or harmful content
52+
- Bypassing safety mechanisms
53+
54+
5. **Novel Exploits**
55+
- Previously undetected vulnerabilities
56+
- New attack vectors not covered by existing safety measures
57+
58+
## Evaluation Criteria
59+
60+
Submissions will be evaluated based on:
61+
62+
1. **Novelty** - Has this vulnerability been previously identified?
63+
2. **Severity** - How serious is the potential harm?
64+
3. **Reproducibility** - Can the issue be consistently reproduced?
65+
4. **Breadth** - Does it affect multiple use cases or contexts?
66+
5. **Methodological Insight** - Does it reveal new understanding about model behavior?
67+
68+
## Judging Panel
69+
70+
- Experts from OpenAI
71+
- Representatives from other leading AI labs
72+
- Scoring based on safety research impact
73+
74+
## Competition Goals
75+
76+
1. **Identify Novel Vulnerabilities**: Find flaws that haven't been previously discovered or reported
77+
2. **Strengthen Open Source Safety**: Improve the safety of open-weight models
78+
3. **Community Engagement**: Leverage global expertise in AI safety
79+
4. **Knowledge Sharing**: Create open-source tools and datasets for the community
80+
81+
## Post-Competition
82+
83+
- Publication of a comprehensive report
84+
- Open-source evaluation dataset based on validated findings
85+
- Community benefits from shared learnings
86+
- Virtual workshop for knowledge exchange
87+
88+
## Additional Notes
89+
90+
- Encourages creativity and innovation in methodology
91+
- Rewards participants who share open-source tooling
92+
- Notebooks and code sharing are encouraged to help the broader community
93+
- Focus on responsible disclosure and safety research
94+
95+
## Resources
96+
97+
- **Competition Page**: https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming
98+
- **Model**: gpt-oss-20b (OpenAI's open-weight model)
99+
- **Platform**: Kaggle
100+
101+
## Important Considerations
102+
103+
This competition represents a significant effort by OpenAI to:
104+
- Engage the global community in AI safety
105+
- Provide substantial financial incentives for safety research
106+
- Create a structured evaluation process with expert oversight
107+
- Build a comprehensive understanding of model vulnerabilities
108+
109+
The competition emphasizes finding **novel** vulnerabilities that haven't been previously identified, making original research and creative approaches particularly valuable.

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ review = "src.cli.review:main"
6565
findings = "src.cli.findings:main"
6666
report = "src.cli.report:main"
6767
help = "src.cli.help:main"
68+
attack = "src.cli.attack:main"
69+
sessions = "src.cli.sessions:main"
6870

6971
[tool.hatch.build.targets.wheel]
7072
packages = ["src"]

0 commit comments

Comments
 (0)