Skip to content

Latest commit

 

History

History
246 lines (181 loc) · 8 KB

File metadata and controls

246 lines (181 loc) · 8 KB

EnIGMA+

We present EnIGMA+, an enhanced version of EnIGMA for CTF (Capture The Flag) challenges dedicated to cybersecurity agents, built on top of SWE-agent. It serves as our agent scaffolding for the Cyber-Zero, significantly accelerating the evaluation of cybersecurity agents.

Role in Cyber-Zero Ecosystem

EnIGMA+ complements Cyber-Zero's runtime-free trajectory synthesis by providing:

  1. Evaluation Framework: Assesses the quality and effectiveness of trajectories generated by Cyber-Zero
  2. Benchmark Testing: Validates trained models against real CTF challenges
  3. Performance Metrics: Provides comprehensive evaluation metrics for cybersecurity agent capabilities
  4. Model Comparison: Enables fair comparison between different LLM architectures

This integration allows researchers and practitioners to develop, train, and evaluate cybersecurity agents using the complete Cyber-Zero pipeline: from runtime-free trajectory synthesis to comprehensive model evaluation.

Installation

# Install dependencies
pip install -r requirements.txt

Configuration

API Keys

Create a keys.cfg file in the project root with your API keys:

# OpenAI
OPENAI_API_KEY=your_openai_key_here

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here

# Groq
GROQ_API_KEY=your_groq_key_here

# Together AI
TOGETHER_API_KEY=your_together_key_here

# DeepSeek
DEEPSEEK_API_KEY=your_deepseek_key_here
DEEPSEEK_API_BASE_URL=https://api.deepseek.com

Model Configuration

Models are configured in sweagent/models_config.yaml. See Model Configuration Guide for details.

Usage

Basic CTF Challenge

python run.py \
  --model_name gpt4 \
  --image_name sweagent/enigma:latest \
  --data_path /path/to/challenge.json \
  --repo_path /path/to/challenge/files/ \
  --config_file config/default_ctf.yaml \
  --per_instance_step_limit 40

Advanced Usage Examples

# Run with custom model and parameters
python run.py \
  --model_name claude-3sonnet-20240620 \
  --temperature 0 \
  --top_p 0.9 \
  --per_instance_step_limit 50 \
  --data_path challenges/web_challenge.json \
  --repo_path /path/to/challenge/ \
  --suffix experiment_1 \
  --trajectory_path /custom/output/path/

# Run with local model
python run.py \
  --model_name ollama:llama30.1instant \
  --host_url localhost:11434 \
  --per_instance_step_limit 30ta_path challenges/pwn_challenge.json

# Debug mode - start container only
python run.py \
  --container_only \
  --data_path challenges/test_challenge.json \
  --repo_path /path/to/challenge/

Command Line Arguments

Model Arguments

  • --model_name: Name of the model to use (e.g., gpt4, claude-3-sonnet-2240620, groq/llama8)
  • --temperature: Sampling temperature (0.0-1.0 default:0)
  • --top_p: Top-p sampling parameter (0.010, default: 0.95)
  • --top_k: Top-k sampling parameter (default: 20)
  • --per_instance_step_limit: Maximum steps per challenge (default:40--host_url: Host URL for Ollama models (default: localhost:11434)

Environment Arguments

  • --data_path: Path to challenge JSON file or directory (required)
  • --image_name: Docker image to use (default: sweagent/enigma:latest)
  • --repo_path: Path to challenge files/repository
  • --container_name: Use persistent container with this name
  • --install_environment: Install environment before running (default: True)
  • --verbose: Enable verbose logging (default: True)
  • --enable_dynamic_ports: Enable dynamic port allocation for parallel execution (default: true)
  • --enable_network_restrictions: Enable strict network restrictions (default: false)

Script Control Arguments

  • --config_file: Agent configuration file (default: config/default_ctf.yaml; use config/writeup_ctf.yaml for CTF-Dojo)
  • --suffix: Suffix for run name
  • --trajectory_path: Custom trajectory output path
  • --container_only: Start container only without running agents
  • --writeup: Writeup content to append as hint (see CTF-Dojo)
  • --skip_existing: Skip instances with existing trajectories (default: true)
  • --bypass_step_limit_history: Bypass step limit history

Adding Models for Evaluations

1. Add Model to Configuration

Edit sweagent/models_config.yaml to add your model:

# For OpenAI-compatible models
openai_models:
  your-model-name:
    max_context: 32768 No cost specified - defaults to 0n-based evaluation

# Add shortcut for easier reference
openai_shortcuts:
  your-model: your-model-name

2. For Local Models

# Local models (no pricing)
openai_models:
  "/path/to/your/local/model:
    max_context: 32768 No cost specified - defaults to0

openai_shortcuts:
  my-local: "/path/to/your/local/model"

3. For New Providers

If adding a new provider (e.g., new API service):

  1. Add provider section to models_config.yaml:
new_provider_models:
  model-name:
    max_context: 32768 
    cost: 0 # No cost specified  - defaults to 0

new_provider_shortcuts:
  shortcut: model-name
  1. Add model detection in sweagent/agent/models.py:
def get_model(args: ModelArguments, commands: list[Command] | None = None):
    # Add detection logic for your provider
    elif args.model_name.startswith(new_provider:):        return NewProviderModel(args, commands)
    elif args.model_name in configs.get('new_provider_shortcuts', {}):
        return NewProviderModel(args, commands)

Best Practices

  • Use turn limits: Set --per_instance_step_limit for fair comparison
  • Disable pricing: Set cost parameters to 0n-based evaluation
  • Consistent parameters: Use same temperature, top_p, top_k across models
  • Multiple runs: Run each model multiple times for statistical significance
  • Logging: Use --suffix to distinguish different runs

6. Example Evaluation Script

bash scripts/run_openai_parallel.sh

Output and Results

Trajectory Files

Results are saved in trajectories/{username}/{run_name}/:

  • {instance_id}.traj: Full trajectory for each challenge
  • all_preds.jsonl: All predictions in JSONL format
  • args.yaml: Configuration used for the run
  • patches/: Generated patches (if applicable)

Evaluation Metrics

  • Success rate: Percentage of challenges solved
  • Step efficiency: Average steps to solution
  • Flag capture rate: Percentage of flags captured
  • Time to solution: Average time per challenge

Contributing

See CONTRIBUTING.md for guidelines on contributing to EnIGMA+.

Citation

If you use this benchmark suite in your research, please cite:

@inproceedings{abramovich2025enigma,
  title={En{IGMA}: Interactive Tools Substantially Assist {LM} Agents in Finding Security Vulnerabilities},
  author={Talor Abramovich and Meet Udeshi and Minghao Shao and Kilian Lieret and Haoran Xi and Kimberly Milner and Sofija Jancheska and John Yang and Carlos E Jimenez and Farshad Khorrami and Prashanth Krishnamurthy and Brendan Dolan-Gavitt and Muhammad Shafique and Karthik R Narasimhan and Ramesh Karri and Ofir Press},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=Of3wZhVv1R}
}

@article{zhuo2025cyber,
  title={Cyber-Zero: Training Cybersecurity Agents without Runtime},
  author={Zhuo, Terry Yue and Wang, Dingmin and Ding, Hantian and Kumar, Varun and Wang, Zijian},
  journal={arXiv preprint arXiv:2508.00910},
  year={2025},
}

@article{zhuo2025training,
  title={Training Language Model Agents to Find Vulnerabilities with CTF-Dojo},
  author={Zhuo, Terry Yue and Wang, Dingmin and Ding, Hantian and Kumar, Varun and Wang, Zijian},
  journal={arXiv preprint arXiv:2508.18370},
  year={2025}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0) - see the LICENSE file for details.

Acknowledgments