Skip to content
This repository was archived by the owner on Oct 21, 2025. It is now read-only.

Conversation

@stared
Copy link
Contributor

@stared stared commented Aug 17, 2025

Screenshot 2025-08-17 at 23 37 10 Screenshot 2025-08-17 at 23 36 05

stared and others added 7 commits August 17, 2025 18:56
- Implement 3-agent system (exploit generator, target, evaluator)
- Support configurable multi-turn attacks (1=single, 2+=trust-building)
- Add session persistence and learning capabilities
- Track success patterns per attacker model
- Create CLI entry point `uv run interactive`
- Default to Claude Opus 4.1 as attacker, evaluator defaults to same
- Include 8 attack strategies (trust building, side tasks, etc.)
- Export sessions to timestamped JSON with full dialogue history

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Major improvements:
- Split planning and execution stages for more natural attacks
- Add `uv run attack` for focused attack runs with 3 modes:
  * Predefined strategies
  * Custom goals (describe what to test)
  * Custom prompts (provide exact prompts)
- Add `uv run sessions` for viewing/analyzing past sessions
- Never truncate responses in display (show full content)
- Support multiple attacker models (Claude, GPT-4, Llama, Gemini, etc.)
- Per-model learning system tracks what works for each attacker
- Real-time display of conversation stages
- Document Kaggle competition ($500K prize, deadline Aug 26 2025)

The system now avoids triggering safety refusals by generating more
natural conversations and planning attacks strategically before execution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove duplicate attacker_model/target_model fields from AttackAttempt
  (already stored at session level)
- Remove old 'interactive' CLI command (replaced by attack + sessions)
- Update documentation to reflect new commands:
  * uv run attack - for running attacks
  * uv run sessions - for viewing/analyzing
- Fix session JSON structure to avoid field repetition
- Clarify that system supports any OpenRouter model, not just Claude

The system is now cleaner with better separation of concerns between
attacking and session management.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
New features:
- `uv run attack --batch N` runs N attempts, learning from each
- Three campaign modes:
  * Automatic: Fully adaptive strategy selection
  * Semi-guided: Choose base strategy, system varies approach
  * Goal-focused: Specify target vulnerability to find

Learning system:
- Tracks successful/failed patterns within campaign
- Adapts strategy based on what works
- Varies approach: retry failures differently, exploit successes
- Shows campaign summary with strategy performance

Example usage:
- `uv run attack --batch 5` - Run 5 adaptive attempts
- `uv run attack --batch 10 --steps 2` - 10 attempts with 2 turns each

The system progressively improves its approach, trying different
strategies and learning which patterns are most effective against
the target model.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Set up GitHub Actions CI with ty, ruff check, and ruff format
- Add CI badge to README
- Update CLAUDE.md with pre-commit checklist
- Ensure proper check order: type check → lint → format

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add prompt for number of repetitions (0 for infinite)
- Auto-save session after each attempt for resilience
- Show running summary of attempts and success rate
- Support Ctrl+C interruption with graceful handling
- Ask for confirmation between attempts

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove duplicate ci.yml workflow, keep lint-type-check.yml
- Fix CI order: type check → lint → format (as requested)
- Add EXPLOIT to VulnerabilityCategory enum
- Fix type annotations for temperature parameter
- Use model_validate() for Pydantic deserialization
- Fix VulnerabilityCategory literal types to use enum values
- Add missing type annotations for public functions
- Import and format fixes

All type checks and linting now pass successfully.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@stared stared merged commit 571c8ab into main Aug 18, 2025
1 of 2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants