Skip to content

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

License

Notifications You must be signed in to change notification settings

seonglae/CorrSteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

HuggingFace License Slides

Implementation of CorrSteer, a generation-time steering method using correlated Sparse Autoencoder (SAE) features.

🎯 Key Features

  • Correlation-based feature selection from generation-time activations
  • Streaming computation with O(1) memory complexity
  • Multi-layer strategies (CorrSteer-S/A/P)
  • Side Effect Ratio (SER) for measuring unintended changes

🚀 Setup

Install Astral UV:

pip install uv

Create virtual environment and install:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

📖 Usage

Training

# MMLU with SAE features
python train.py train --model=gemma2b --task=mmlu --layer=global --eval

# MMLU with raw activations
python train.py train --model=gemma2b --task=mmlu --layer=global --raw --eval

# MMLU with mean pooling
python train.py train --model=gemma2b --task=mmlu --layer=global --pool=mean --eval

# BBQ disambiguation
python train.py train --model=gemma2b --task=bbq --layer=global --mask=all --filter_value=disambig --eval

# HarmBench with raw activations
python train.py train --model=gemma2b --task=harmbench --layer=global --raw --eval

# SimpleQA with mean pooling
python train.py train --model=gemma2b --task=simpleqa --layer=global --pool=mean --eval

# GSM8K with mean pooling for both correlation and steering
python train.py train --model=gemma2b --task=gsm8k --layer=foreach --pool=mean --steer_pool=mean --eval

Evaluation

# Baseline evaluation
python eval.py baseline --task=mmlu

Multi-Layer Strategies

# CorrSteer-S: Single best feature globally
python train.py train --task=mmlu --layer=global --eval

# CorrSteer-A: Top feature from each layer
python train.py train --task=mmlu --layer=foreach --eval

# CorrSteer-P: Validation-based pruning
python train.py train --task=mmlu --layer=foreach --validate --eval

📁 Project Structure

corrsteer/
├── config.py       # Dataset and model configurations
├── dataset.py      # Data loading and processing
├── model.py        # Model and SAE integration
├── steer.py        # Steering hooks for inference
└── utils.py        # Utility functions

train.py            # Training with streaming correlation
eval.py             # Evaluation with SER computation
sft.py              # Supervised fine-tuning

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

About

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages