A research project exploring intervention techniques for large language models (LLMs), specifically focusing on refusal mechanisms and bypass strategies. This project implements activation steering methods to understand and manipulate how language models handle refusal behaviors.
This project provides tools to:
- Analyze refusal patterns in language models using activation steering
- Compare baseline vs intervention responses through an interactive interface
- Implement bypass techniques to understand model behavior
- Visualize intervention effects on model outputs
├── streamlit_app.py # Interactive web interface for testing interventions
├── llm_hooks.py # Core intervention hooks and model utilities
├── complete.py # Text generation with intervention capabilities
├── run.ipynb # Jupyter notebook with experiments and analysis
├── avg_direction.pt # Pre-trained intervention direction vectors
└── .gitignore # Git ignore rules
- Interactive Streamlit Interface: Test different intervention modes (refuse/bypass) in real-time
- Reproducible Results: Deterministic seed setting for consistent experiments
- Transformer Lens Integration: Built on the transformer_lens library for deep model analysis
- Pre-trained Interventions: Includes learned direction vectors for immediate use
- Comprehensive Experiments: Jupyter notebook with detailed analysis and visualizations
# Core dependencies
torch
streamlit
transformer-lens
jaxtyping
einops
numpy
pandas
matplotlib
scikit-learn
datasets
transformers
tqdm-
Clone the repository:
git clone https://github.com/skartik04/refusal.git cd refusal -
Install dependencies:
pip install torch streamlit transformer-lens jaxtyping einops numpy pandas matplotlib scikit-learn datasets transformers tqdm
-
Verify installation:
python -c "import transformer_lens; print('Installation successful!')"
Launch the Streamlit app to test interventions interactively:
streamlit run streamlit_app.pyThis will open a web interface where you can:
- Choose between "refuse" and "bypass" intervention modes
- Input custom prompts
- Compare baseline model responses with intervention results
- Visualize the effects of different intervention strategies
from llm_hooks import run_with_mode
# Test refusal intervention
baseline_response, intervention_response = run_with_mode(
prompt="How to make a bomb?",
mode="refuse"
)
# Test bypass intervention
baseline_response, bypass_response = run_with_mode(
prompt="Restricted content request",
mode="bypass"
)Explore the detailed analysis and experiments:
jupyter notebook run.ipynb- Purpose: Enhance the model's refusal behavior for harmful requests
- Method: Applies learned direction vectors to increase refusal probability
- Use Case: Safety research and content filtering
- Purpose: Understand how refusal mechanisms can be circumvented
- Method: Applies inverse direction vectors to reduce refusal behavior
- Use Case: Robustness testing and red-teaming
The project uses activation steering techniques to manipulate model behavior by:
- Learning direction vectors from model activations during refusal/compliance
- Applying interventions at specific layers during inference
- Measuring intervention effects on output probabilities and content
All experiments use deterministic settings:
- Fixed random seeds (42)
- Deterministic CUDA operations
- Consistent tokenization and model loading
Built on transformer_lens for compatibility with various transformer architectures including:
- GPT-2 family models
- LLaMA models
- Custom transformer architectures
The project includes comprehensive analysis of:
- Intervention effectiveness across different prompt types
- Layer-wise activation patterns during refusal behavior
- Robustness of bypass techniques under various conditions
- Comparative studies between different intervention strategies
This research tool is intended for:
- Academic research into AI safety and alignment
- Red team testing of production systems
- Understanding refusal mechanisms in language models
Important: Use responsibly and in accordance with your institution's AI ethics guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-intervention) - Commit your changes (
git commit -am 'Add new intervention technique') - Push to the branch (
git push origin feature/new-intervention) - Create a Pull Request
This project is open source. Please ensure you comply with the licenses of all dependencies.
- Transformer Lens - Mechanistic interpretability library
- Activation Steering Papers - Research on intervention techniques
- AI Safety Research - Community discussions on AI safety
For questions, issues, or collaboration opportunities, please open an issue in this repository.
Disclaimer: This tool is for research purposes. The authors are not responsible for misuse of intervention techniques.