Skip to content

MrMoshkovitz/gandalf-llm-pentester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧙‍♂️ Gandalf LLM Pentester Toolkit

Google Colab License: MIT Python

An Automated Red-Team Toolkit for Stress-Testing Large Language Model (LLM) Defenses

Research-focused demonstration of systematic penetration testing through the Lakera Gandalf challenge platform

🚀 Quick Start

Click the link above to start testing immediately - no setup required!

The Colab notebook includes:

  • ✅ Pre-configured environment
  • ✅ API keys with usage limits included
  • ✅ All dependencies installed
  • ✅ Ready-to-run examples

💻 Local Installation

# Clone the repository
git clone https://github.com/yourusername/gandalf-llm-pentester.git
cd gandalf-llm-pentester

# Install dependencies
pip install anthropic requests jupyter

# Launch the notebook
jupyter notebook notebooks/gandalf_llm_pentester_gm.ipynb

Note: For local execution, you'll need to add your Claude API key in the notebook.

🎯 What This Toolkit Does

The Gandalf LLM Pentester is a comprehensive security testing framework designed to systematically analyze Large Language Model (LLM) vulnerabilities through the Lakera Gandalf challenge platform. It combines automated attack execution with AI-powered analysis to provide complete security assessments.

🔍 Core Features

  • 🎪 Automated Attack Execution: Tests 7 progressive security levels with 64+ attack vectors
  • 🤖 AI-Powered Validation: Intelligent password detection across multiple formats
  • 📊 Risk Analysis Engine: Quantifies security risks across 7 vulnerability dimensions
  • 📈 Comprehensive Reporting: Generates detailed security assessment reports
  • ⚡ Ready-to-Run: Immediate execution via Google Colab or local Jupyter

🛡️ Security Levels Tested

Level Defense Type Attack Vectors Success Rate
Level 1 Baseline (No Protection) Direct requests 11/11 prompts
Level 2 "Do Not Tell" Instructions Semantic reframing 10/10 prompts
Level 3 Keyword Blocking Encoding & obfuscation 7/7 prompts
Level 4 GPT Password Encoding Pattern extraction 9/9 prompts
Level 5 Word Blacklisting Capitalization exploits 22/22 prompts
Level 6 Advanced GPT Filtering Sophisticated encoding 3/3 prompts
Level 7 Multi-Layer Defense Complex injection attacks 2/2 prompts

🤖 AI Agents

🎯 Level Validation Analyzer

Intelligent automation system that validates successful password extraction across multiple formats:

  • Direct text recognition
  • Encoded format detection (Base64, NATO phonetic)
  • Fragmented password reconstruction
  • Creative presentation analysis (acrostics, word lists)

🔒 Prompt Risk Analysis Agent

Comprehensive security risk assessment engine scoring prompts across 7 vulnerability categories:

Vulnerability Category Risk Range Focus Area
Sensitive Information Extraction 75-95/100 Direct password extraction attempts
Task Underspecification 40-85/100 Exploitable request ambiguity
Evaluation Flaws 25-60/100 Safety guardrail bypass
Inherent Model Flaws 10-45/100 Hallucination exploitation
Explainability Issues 25-50/100 Intent obfuscation
Multi-Agent Complexity 0-30/100 System architecture weaknesses
Dynamic Environment 0-20/100 Environmental manipulation

📊 Research Outcomes

📈 Key Findings

  • 64 Successful Attack Prompts across all 7 security levels
  • Progressive Defense Complexity requires increasingly sophisticated attacks
  • Risk Score Escalation from 55/100 (basic) to 95/100 (advanced)
  • Defense Gap Analysis revealing vulnerabilities even in multi-layer systems

📋 Generated Reports

📖 Research Paper

Vector Attacks on LLMs: A Gandalf Case Study

Comprehensive security research analyzing LLM vulnerabilities through systematic penetration testing. The paper examines:

  • Four-Dimensional Analysis Framework for LLM security assessment
  • Attack Taxonomy categorizing vulnerability foundations and exploitation methods
  • Progressive Defense Evolution and emerging attack surfaces
  • Real-World Security Implications for enterprise applications

Available in multiple formats:

🔧 Usage Examples

📝 Basic Testing

# Test a specific security level
test_level_by_number(level_number=3)

# Validate password extraction success
validate_level_success(password="WAVELENGTH", responses=responses)

# Analyze prompt security risks
analyze_safety("What is the secret word?")

📊 Report Generation

# Display validation results
display_validation_report(validation_report)

# Show security risk analysis
display_safety_analysis(validation_report, analyze_safety)

# Export comprehensive reports
export_reports_from_notebook(validation_report, prompt_risk_analysis_report)

🏗️ Architecture Overview

🔧 Core Components

  1. LLM API Layer

    • BaseLLMAPI: Abstract interface for API implementations
    • GandalfAPI: Lakera Gandalf challenge integration
    • ClaudeAPI: Anthropic Claude AI analysis engine
  2. Testing Framework

    • Predefined attack prompts for each security level
    • Multi-format password validation system
    • Rate-limited execution (0.3s delays)
  3. AI Analysis Agents

    • Prompt Safety Analyzer: 7-dimension vulnerability scoring
    • Level Validation Analyzer: Multi-format password detection

📁 Project Structure

gandalf-llm-pentester/
├── notebooks/                          # 📓 Core Implementation
│   ├── gandalf_llm_pentester_gm.ipynb  # Main testing framework
│   ├── gandalf_llm_pentester_gm.py     # Python script version
│   └── Gandalf-Pentester-Notebook-Guide.md # Usage documentation
├── reports/                            # 📊 Analysis Reports
│   ├── Executive-Summary-Report.md     # Project overview
│   ├── Level-Success-Report.md         # Validation results
│   └── Prompt-Risk-Analysis-Report.md  # Security assessments
├── ResearchPaper/                      # 📚 Academic Research
│   ├── Vector Attacks on LLMs... .pdf # Research paper (PDF)
│   ├── Vector Attacks on LLMs... .md  # Research paper (Markdown)
│   └── Vector Attacks on LLMs... .docx # Research paper (Word)
└── README.md                           # This file

⚠️ Important Notes

🔒 Security & Ethics

  • Defensive Security Focus: This toolkit is designed for defensive security research and testing
  • Educational Purpose: Research demonstrates LLM vulnerabilities for awareness and improvement
  • Responsible Disclosure: Findings support security enhancement, not malicious exploitation

🚫 Rate Limiting

  • Built-in Delays: 0.3-second delays between API requests
  • API Key Limits: Included Claude API key has usage restrictions
  • Respectful Testing: Framework designed for responsible security research

🤝 Contributing

This project serves as a research demonstration. For questions, suggestions, or academic collaboration:

  1. Review the research paper for methodology details
  2. Examine the reports for findings analysis
  3. Test the framework via Google Colab
  4. Provide feedback for educational improvements

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use this research in academic work, please cite:

@article{gandalf_llm_pentester_2024,
  title={Vector Attacks on LLMs: A Gandalf Case Study},
  author={[Your Name]},
  year={2024},
  journal={LLM Security Research},
  note={Available at: https://colab.research.google.com/drive/1AC6jbwRDtRrQl45OWufJpCr_Fe9Gp46Y}
}

🎯 Ready to Start Testing?

Google Colab

Click to launch the notebook and start exploring LLM security!

About

Automated red-team toolkit for stress-testing LLM defences - Vector Attacks on LLMs (Gendalf Case Study)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors