Sparse Autoencoder Features for Classifications and Transferability

MOSAIC (Multilingual and Multimodal Observations of Sparse Autoencoders for Interpretable Classification) is a simple and efficient pipeline for extracting model activations, enabling you to fit linear probes or use sparse autoencoders to fit explainable classifiers such as decision trees and linear probes.

We use a straightforward YAML file structure to allow extraction across layers, different pooling methods, and even across languages or modalities.

Project Overview

VLLM/LLM Loading: Load pre-trained Vision-Language or Language Models to obtain hidden states from specified layers.
SAE Feature Extraction: Pass the hidden states through a Sparse Autoencoder (SAE) to create sparse, interpretable feature representations. This approach allows the use of only specific extracted features for further classification.
Classifier Integration: Use the extracted SAE features as inputs to explainable classifiers, with visualizations planned to interpret how each feature contributes to the final predictions.

Key Findings

Binarization improves performance.
Features transfer across languages.
Features transfer from text-only to image tasks, e.g., Gemma to LLAVA/Peligemma.

Prerequisites

To set up the required environment, follow these steps:

Create the environment from the environment.yml file:
```
mamba env create -f environment.yml
```
Activate the environment:
```
mamba activate saefari
```

Step 1: Update the Configuration File

Update the config.yaml file with the desired settings to configure the model, datasets, and classifier parameters.

Configuration File Structure (`config.yaml`)

settings:
  base_save_dir: "./output/activations"
  base_classify_dir: "./output/classifications"
  batch_size: 1
  model_type: "llm"
  sae_location: "res"
  test_size: 0.2
  tree_depth: 5
  act_only: True
  cuda_visible_devices: "0"

models:
  - name: "google/gemma-2b"
    layers: [6, 12, 17]
    widths: ["16k"]

  # Additional models can be added here


datasets:
  - name: "Anthropic/election_questions"
    config_name: ""
    split: "test"
    text_field: "question"
    label_field: "label"

  # Additional datasets can be added here

classification_params:
  top_n_values: [0, 20, 50]
  binarize_values: [null, 1.0]
  extra_top_n: -1
  extra_binarize_value: null

Step 2: Run the Experiments

Use the run_experiments.py script to run feature extraction and classification experiments.

Script: `run_experiments.py`

This script allows you to extract hidden states from models and use them to train explainable classifiers. The script reads the configuration from the conf.yaml file and automates both extraction and classification processes.

Usage:

python run_experiments.py [--extract-only] [--classify-only]

--extract-only: Run only the feature extraction process.
--classify-only: Run only the classification process.

The script performs the following tasks:

Feature Extraction: Extracts hidden states from the specified layers of each model, processes them through the SAE, and saves the resulting activations.
Classification: Uses the extracted SAE features to train explainable classifiers. Various configurations of top-N values and binarization settings are explored to identify the optimal feature representations.

All data is saved into the activations or classifications directories.

Step 3: Visualise Performance

You can visualize these differences using a simple Dash app configured to search according to your YAML directories.

You can visualize either specific configurations and top feature activations or compare performance across models and hyperparameters.

To run the visualization app, simply execute:

python app/main.py

This will open the app in a new tab.

Example Workflow

Define model, dataset, and classification settings in conf.yaml.
Run run_experiments.py to extract features and classify them:
```
python run_experiments.py
```
Run app/main.py to visualize the results.

Directory Structure

src/: Contains scripts for feature extraction (step1_extract_all.py) and classification (step2_dataset_classify.py).
output/: Stores extracted activations (activations) and classification results (classifications).
app/: Contains the Dash app (`app.py`)for visualization.

License

This project is licensed under the Apache 2 License. See the LICENSE file for details.

Citation

@misc{gallifant2025sparseautoencoderfeaturesclassifications,
      title={Sparse Autoencoder Features for Classifications and Transferability}, 
      author={Jack Gallifant and Shan Chen and Kuleen Sasse and Hugo Aerts and Thomas Hartvigsen and Danielle S. Bitterman},
      year={2025},
      eprint={2502.11367},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.11367}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
app		app
interactive		interactive
output		output
src		src
.gitignore		.gitignore
config.yaml		config.yaml
readme.md		readme.md
run_experiments.py		run_experiments.py
run_results.py		run_results.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparse Autoencoder Features for Classifications and Transferability

Project Overview

Key Findings

Prerequisites

Step 1: Update the Configuration File

Configuration File Structure (`config.yaml`)

Step 2: Run the Experiments

Script: `run_experiments.py`

Usage:

Step 3: Visualise Performance

Example Workflow

Directory Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

shan23chen/MOSAIC

Folders and files

Latest commit

History

Repository files navigation

Sparse Autoencoder Features for Classifications and Transferability

Project Overview

Key Findings

Prerequisites

Step 1: Update the Configuration File

Configuration File Structure (config.yaml)

Step 2: Run the Experiments

Script: run_experiments.py

Usage:

Step 3: Visualise Performance

Example Workflow

Directory Structure

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Configuration File Structure (`config.yaml`)

Script: `run_experiments.py`

Packages