MOSAIC (Multilingual and Multimodal Observations of Sparse Autoencoders for Interpretable Classification) is a simple and efficient pipeline for extracting model activations, enabling you to fit linear probes or use sparse autoencoders to fit explainable classifiers such as decision trees and linear probes.
We use a straightforward YAML file structure to allow extraction across layers, different pooling methods, and even across languages or modalities.
- VLLM/LLM Loading: Load pre-trained Vision-Language or Language Models to obtain hidden states from specified layers.
- SAE Feature Extraction: Pass the hidden states through a Sparse Autoencoder (SAE) to create sparse, interpretable feature representations. This approach allows the use of only specific extracted features for further classification.
- Classifier Integration: Use the extracted SAE features as inputs to explainable classifiers, with visualizations planned to interpret how each feature contributes to the final predictions.
- Binarization improves performance.
- Features transfer across languages.
- Features transfer from text-only to image tasks, e.g., Gemma to LLAVA/Peligemma.
To set up the required environment, follow these steps:
-
Create the environment from the
environment.yml
file:mamba env create -f environment.yml
-
Activate the environment:
mamba activate saefari
Update the config.yaml
file with the desired settings to configure the model, datasets, and classifier parameters.
settings:
base_save_dir: "./output/activations"
base_classify_dir: "./output/classifications"
batch_size: 1
model_type: "llm"
sae_location: "res"
test_size: 0.2
tree_depth: 5
act_only: True
cuda_visible_devices: "0"
models:
- name: "google/gemma-2b"
layers: [6, 12, 17]
widths: ["16k"]
# Additional models can be added here
datasets:
- name: "Anthropic/election_questions"
config_name: ""
split: "test"
text_field: "question"
label_field: "label"
# Additional datasets can be added here
classification_params:
top_n_values: [0, 20, 50]
binarize_values: [null, 1.0]
extra_top_n: -1
extra_binarize_value: null
Use the run_experiments.py
script to run feature extraction and classification experiments.
This script allows you to extract hidden states from models and use them to train explainable classifiers. The script reads the configuration from the conf.yaml
file and automates both extraction and classification processes.
python run_experiments.py [--extract-only] [--classify-only]
--extract-only
: Run only the feature extraction process.--classify-only
: Run only the classification process.
The script performs the following tasks:
- Feature Extraction: Extracts hidden states from the specified layers of each model, processes them through the SAE, and saves the resulting activations.
- Classification: Uses the extracted SAE features to train explainable classifiers. Various configurations of top-N values and binarization settings are explored to identify the optimal feature representations.
All data is saved into the activations
or classifications
directories.
You can visualize these differences using a simple Dash app configured to search according to your YAML directories.
You can visualize either specific configurations and top feature activations or compare performance across models and hyperparameters.
To run the visualization app, simply execute:
python app/main.py
This will open the app in a new tab.
-
Define model, dataset, and classification settings in
conf.yaml
. -
Run
run_experiments.py
to extract features and classify them:python run_experiments.py
-
Run
app/main.py
to visualize the results.
- src/: Contains scripts for feature extraction (
step1_extract_all.py
) and classification (step2_dataset_classify.py
). - output/: Stores extracted activations (
activations
) and classification results (classifications
). - app/: Contains the Dash app (`app.py`)for visualization.
This project is licensed under the Apache 2 License. See the LICENSE
file for details.
@misc{gallifant2025sparseautoencoderfeaturesclassifications,
title={Sparse Autoencoder Features for Classifications and Transferability},
author={Jack Gallifant and Shan Chen and Kuleen Sasse and Hugo Aerts and Thomas Hartvigsen and Danielle S. Bitterman},
year={2025},
eprint={2502.11367},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.11367},
}