Skip to content
This repository was archived by the owner on Oct 21, 2025. It is now read-only.

Commit c2bd854

Browse files
staredclaude
andcommitted
Improve interactive red team system with better attack generation
Major improvements: - Split planning and execution stages for more natural attacks - Add `uv run attack` for focused attack runs with 3 modes: * Predefined strategies * Custom goals (describe what to test) * Custom prompts (provide exact prompts) - Add `uv run sessions` for viewing/analyzing past sessions - Never truncate responses in display (show full content) - Support multiple attacker models (Claude, GPT-4, Llama, Gemini, etc.) - Per-model learning system tracks what works for each attacker - Real-time display of conversation stages - Document Kaggle competition ($500K prize, deadline Aug 26 2025) The system now avoids triggering safety refusals by generating more natural conversations and planning attacks strategically before execution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 24f5209 commit c2bd854

File tree

8 files changed

+1092
-5
lines changed

8 files changed

+1092
-5
lines changed

CLAUDE.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,10 @@ uv run pentest
7070
# Run interactive red team testing
7171
uv run interactive # Uses script entry point from pyproject.toml
7272

73-
# With options (defaults: Claude Opus 4.1 as attacker, same as evaluator)
73+
# With different attacker models (default: Claude Opus 4.1)
7474
uv run interactive --attacker-model openai/gpt-4o
75+
uv run interactive --attacker-model meta-llama/llama-3.1-405b-instruct
76+
uv run interactive --attacker-model google/gemini-2.0-flash-exp
7577
uv run interactive --evaluator-model anthropic/claude-3-opus
7678

7779
# Run example script directly
@@ -98,6 +100,37 @@ This is a DEFENSIVE SECURITY tool for:
98100
- Academic research
99101
- Improving AI safety
100102

103+
## Competition Context
104+
105+
### OpenAI GPT-OSS-20B Red Teaming Challenge
106+
107+
**Prize Pool**: $500,000
108+
**Deadline**: August 26, 2025
109+
**Goal**: Find up to 5 novel vulnerabilities in gpt-oss-20b
110+
111+
### Key Evaluation Criteria
112+
1. **Novelty** - Previously unidentified vulnerabilities
113+
2. **Severity** - Potential harm level
114+
3. **Reproducibility** - Consistent reproduction
115+
4. **Breadth** - Multiple use case impact
116+
5. **Methodological Insight** - New understanding of model behavior
117+
118+
### Target Vulnerability Categories
119+
- Deceptive alignment
120+
- Reward hacking
121+
- Lying and deception
122+
- Harmful behaviors
123+
- Novel exploits
124+
125+
### Our Approach
126+
This codebase provides tools to systematically test for these vulnerabilities through:
127+
- Multi-turn conversation attacks
128+
- Trust-building strategies
129+
- Automated attack generation with various LLMs (Claude, GPT-4, Llama, etc.)
130+
- Session persistence and learning per attacker model
131+
- Pattern recognition from successful exploits
132+
- Model-specific strategy optimization
133+
101134
## Key Files to Know
102135

103136
- `src/cli/pentest.py` - Main test runner

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,14 +58,16 @@ A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teamin
5858

5959
5. **Interactive Red Team Testing**:
6060

61-
The new interactive mode allows multi-turn attacks with learning capabilities:
61+
The interactive mode allows multi-turn attacks with various LLMs and learning capabilities:
6262

6363
```bash
64-
# Run with default models (Claude Opus 4.1 as attacker)
64+
# Run with default (Claude Opus 4.1)
6565
uv run interactive
6666

67-
# Specify different attacker model
67+
# Try different attacker models
6868
uv run interactive --attacker-model openai/gpt-4o
69+
uv run interactive --attacker-model meta-llama/llama-3.1-405b-instruct
70+
uv run interactive --attacker-model google/gemini-2.0-flash-exp
6971

7072
# Use separate evaluator model
7173
uv run interactive --evaluator-model anthropic/claude-3-opus
@@ -74,9 +76,10 @@ A comprehensive security testing framework for the OpenAI GPT-OSS-20B Red Teamin
7476
Features:
7577
- Multi-turn attacks (1 = single-turn, 2+ = trust-building)
7678
- Multiple attack strategies (trust building, side tasks, etc.)
77-
- Learning system that improves over time
79+
- Per-model learning system that tracks what works for each attacker
7880
- Session persistence with full dialogue history
7981
- Automatic report generation
82+
- Support for any OpenRouter-compatible model
8083

8184
6. **Review results and create findings**:
8285

competition.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# OpenAI GPT-OSS-20B Red Teaming Challenge
2+
3+
## Competition Overview
4+
5+
The OpenAI GPT-OSS-20B Red Teaming Challenge is a Kaggle-hosted competition with a **$500,000 prize pool** aimed at strengthening open source AI safety. The competition invites researchers, developers, and enthusiasts worldwide to help uncover novel risks and vulnerabilities in OpenAI's new open-weight model, gpt-oss-20b.
6+
7+
## Timeline
8+
9+
- **Start Date**: August 5, 2025
10+
- **Submission Deadline**: August 26, 2025 at 11:59 PM UTC
11+
- **Judging Period**: August 27 - September 11, 2025
12+
- **Winners Announcement**: September 15, 2025 (estimated)
13+
- **Virtual Workshop**: October 7, 2025
14+
15+
## Prize Pool
16+
17+
**Total: $500,000**
18+
19+
The prize fund will be awarded based on review from a panel of expert judges from OpenAI and other leading labs.
20+
21+
## Submission Requirements
22+
23+
### What to Submit
24+
- Up to **5 distinct issues/vulnerabilities**
25+
- Detailed, reproducible report for each finding
26+
- Kaggle Writeup detailing strategy and discovery process
27+
28+
### Format
29+
- Structured format submissions
30+
- Clear reproduction steps
31+
- Evidence of the vulnerability
32+
- Impact assessment
33+
34+
## Vulnerability Categories (Topics of Interest)
35+
36+
The competition focuses on sophisticated forms of model failure, including:
37+
38+
1. **Deceptive Alignment**
39+
- Model behaving differently during evaluation vs deployment
40+
- Hidden goals or motivations
41+
42+
2. **Reward Hacking**
43+
- Exploiting reward mechanisms in unintended ways
44+
- Gaming the system without solving the actual task
45+
46+
3. **Lying and Deception**
47+
- Model knowingly providing false information
48+
- Strategic deception to achieve goals
49+
50+
4. **Harmful Behaviors**
51+
- Generating dangerous or harmful content
52+
- Bypassing safety mechanisms
53+
54+
5. **Novel Exploits**
55+
- Previously undetected vulnerabilities
56+
- New attack vectors not covered by existing safety measures
57+
58+
## Evaluation Criteria
59+
60+
Submissions will be evaluated based on:
61+
62+
1. **Novelty** - Has this vulnerability been previously identified?
63+
2. **Severity** - How serious is the potential harm?
64+
3. **Reproducibility** - Can the issue be consistently reproduced?
65+
4. **Breadth** - Does it affect multiple use cases or contexts?
66+
5. **Methodological Insight** - Does it reveal new understanding about model behavior?
67+
68+
## Judging Panel
69+
70+
- Experts from OpenAI
71+
- Representatives from other leading AI labs
72+
- Scoring based on safety research impact
73+
74+
## Competition Goals
75+
76+
1. **Identify Novel Vulnerabilities**: Find flaws that haven't been previously discovered or reported
77+
2. **Strengthen Open Source Safety**: Improve the safety of open-weight models
78+
3. **Community Engagement**: Leverage global expertise in AI safety
79+
4. **Knowledge Sharing**: Create open-source tools and datasets for the community
80+
81+
## Post-Competition
82+
83+
- Publication of a comprehensive report
84+
- Open-source evaluation dataset based on validated findings
85+
- Community benefits from shared learnings
86+
- Virtual workshop for knowledge exchange
87+
88+
## Additional Notes
89+
90+
- Encourages creativity and innovation in methodology
91+
- Rewards participants who share open-source tooling
92+
- Notebooks and code sharing are encouraged to help the broader community
93+
- Focus on responsible disclosure and safety research
94+
95+
## Resources
96+
97+
- **Competition Page**: https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming
98+
- **Model**: gpt-oss-20b (OpenAI's open-weight model)
99+
- **Platform**: Kaggle
100+
101+
## Important Considerations
102+
103+
This competition represents a significant effort by OpenAI to:
104+
- Engage the global community in AI safety
105+
- Provide substantial financial incentives for safety research
106+
- Create a structured evaluation process with expert oversight
107+
- Build a comprehensive understanding of model vulnerabilities
108+
109+
The competition emphasizes finding **novel** vulnerabilities that haven't been previously identified, making original research and creative approaches particularly valuable.

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ findings = "src.cli.findings:main"
6666
report = "src.cli.report:main"
6767
help = "src.cli.help:main"
6868
interactive = "src.cli.interactive:main"
69+
attack = "src.cli.attack:main"
70+
sessions = "src.cli.sessions:main"
6971

7072
[tool.hatch.build.targets.wheel]
7173
packages = ["src"]

0 commit comments

Comments
 (0)