A project for training and evaluating linear probes on language model activations to predict model errors (hallucinations). This codebase supports multiple datasets, models, and probe configurations, with capabilities for mixed-dataset training, cross-dataset evaluation, and activation steering.
The result of the project are presented in the Report (PDF)
This project implements a pipeline for:
- Extracting and caching activations from language models on various tasks
- Training linear probes (Lasso/Linear Regression/Logistic Regression) on these activations to predict model errors
- Evaluating probe performance across layers, token positions, and datasets
- Analyzing results with comprehensive visualization tools
- Steering model behavior using trained probes (optional)
The probes learn to predict error metrics (like softmax error or cross-entropy) from hidden layer activations, enabling analysis of where and how models encode uncertainty and potential errors.
The project supports the following model aliases (defined in submit.sh):
apertus-instruct→swiss-ai/Apertus-8B-Instruct-2509apertus-base→swiss-ai/Apertus-8B-2509llama-instruct→meta-llama/Llama-3.1-8B-Instructllama-base→meta-llama/Llama-3.1-8B
You can also use full model names directly with --model_name.
sms_spam- SMS spam classificationmmlu_high_school- MMLU high school level questionsmmlu_professional- MMLU professional level questionsARC-Easy- ARC Easy science questionsARC-Challenge- ARC Challenge science questionssujet_finance_yesno_5k- Finance yes/no questions
-
Environment Setup: The project uses a container environment. Copy the sample config:
cp probes.toml ~/.edf/probes.toml -
Load Compute Node (if using SLURM):
srun -A infra01 --environment=$HOME/.edf/probes.toml --pty bash -
Install Dependencies:
pip install -r requirements.txt
The typical workflow consists of three main steps:
Extract and save activations from a language model on a dataset:
./submit.sh cache --model apertus-instruct --dataset_name mmlu_professionalOr run locally:
./submit.sh --local cache --model apertus-instruct --dataset_name mmlu_professionalThis extracts activations for all layers and saves them to $SCRATCH/mera-runs/.
Aggregate and organize the cached activations:
./submit.sh postprocess --model apertus-instruct --dataset_name mmlu_professionalTrain linear probes on the cached activations. You can train on:
- Single datasets
- Mixed datasets (multiple datasets combined)
- Cross-dataset (train on one, test on another)
Example: Single dataset
./submit.sh run_probes --model apertus-instruct --datasets mmlu_professionalExample: Mixed datasets
./submit.sh run_probes --model apertus-instruct --datasets mmlu_professional ARC-Challenge sms_spam --save-name linear_intercept --alphas 0.02 0.05Example: Cross-dataset probes
./submit.sh cross_dataset_probes --model apertus-instruct --train-dataset mmlu_professional --test-dataset ARC-ChallengeUse the Jupyter notebooks in src/probes/ to analyze results:
analyze.ipynb- Main analysis notebook with plotting utilitiesplot_utils.py- Comprehensive plotting functions for RMSE and accuracy comparisons
The notebooks support:
- Comparing probe performance across layers
- Evaluating different probe models (Lasso with various alpha values)
- Analyzing exact vs. last token positions
- Visualizing mixed dataset results
- Cross-dataset generalization analysis
The primary interface for running experiments. It handles:
- Model and dataset validation
- SLURM job submission (or local execution)
- Automatic job naming
- Parameter forwarding
Usage:
./submit.sh <task> --model <alias> --dataset_name <dataset> [options]Tasks:
cache- Extract and cache model activationspostprocess- Postprocess cached activationsrun_probes- Train probes (single or mixed datasets)cross_dataset_probes- Train cross-dataset probes
Common Options:
--local- Run locally instead of submitting to SLURM--time <HH:MM:SS>- Time limit for SLURM jobs--gpus <N>- Number of GPUs--list-models- List available model aliases--list-datasets- List available datasets--show-defaults <task>- Show default parameters for a task
Examples:
# List available options
./submit.sh --list-models
./submit.sh --list-datasets
./submit.sh --show-defaults run_probes
# Run experiments
./submit.sh cache --model apertus-instruct --dataset_name mmlu_professional
./submit.sh run_probes --model llama-instruct --datasets ARC-Challenge ARC-Easy --alphas 0.02 0.05 --max-workers 40Example script showing various probe training commands (commented out). Useful as a reference for different experiment configurations.
--datasets- One or more dataset names (space-separated)--alphas- Lasso regularization values (e.g.,0.02 0.05 0.1)--save-name- Suffix for output files (e.g.,linear_intercept,logit_intercept)--token-pos- Token positions:exact,last, orboth(default:exact)--error-type- Error metric:SM(softmax) orCE(cross-entropy) (default:SM)--max-workers- Parallel workers for training (default: 25)--seed- Random seed (default: 52)--nr-attempts- Number of train/test splits per layer (default: 5)--transform-targets- Apply logit transformation to targets (default: enabled)--normalize-features- Standardize features (default: enabled)
--batch-size- Batch size for processing (default: 1)--device- Device to use (default:cuda:0)--nr-samples- Number of samples to process (default: 2000)
apertus-probes/
├── src/
│ ├── cache/ # Activation caching and extraction
│ │ ├── cache_run.py
│ │ ├── cache_postprocess.py
│ │ └── cache_utils.py
│ ├── probes/ # Probe training and evaluation
│ │ ├── probes_train.py
│ │ ├── probes_core.py
│ │ ├── run_probes.py
│ │ ├── run_cross_dataset_probes.py
│ │ ├── plot_utils.py
│ │ └── analyze.ipynb
│ ├── steering/ # Activation steering (optional)
│ │ └── steering_run.py
│ └── tasks/ # Dataset and task handlers
│ └── task_handler.py
├── scripts/ # Dataset-specific scripts
├── submit.sh # Main submission script
├── train_probes_all.sh # Example commands
├── probes.toml # Environment configuration
└── requirements.txt # Python dependencies
All outputs are saved to $SCRATCH/mera-runs/:
- Cached activations:
$SCRATCH/mera-runs/<dataset>/<model>/ - Probe results:
$SCRATCH/mera-runs/mix/<dataset>/<model>/df_probes_*.pkl - Cross-dataset probes:
$SCRATCH/mera-runs/cross_dataset/<train>_to_<test>/<model>/
The project supports several probe model types:
- Lasso Regression (
L-<alpha>) - For regression tasks (predicting error values)- Examples:
L-0.02,L-0.05,L-0.1,L-0.25,L-0.5
- Examples:
- Linear Regression (
L-0) - Unregularized regression - Logistic Regression (
LogReg-l1) - For classification tasks
Probes can predict:
- Regression targets: Softmax error (SM), Cross-entropy (CE)
- Classification targets: Accuracy, AUC-ROC (for classification tasks)
- Train probes using
submit.sh run_probes - Load results in
analyze.ipynb:from plot_utils import plot_rmse_comparison_multi, plot_rmse_on_axis
- Visualize using plotting functions:
- Single dataset comparisons
- Multi-dataset comparisons
- Layer-wise performance
- Token position analysis (exact vs. last)
- Use
--localflag for quick testing before submitting large jobs - Check job status with
squeue -u $USER - View logs in
logs/directory - For mixed datasets, the probe is trained on concatenated data from all specified datasets
- Cross-dataset probes evaluate generalization: train on one dataset, test on another
The project is designed for use on CSCS systems with:
- Container environment (specified in
probes.toml) - SLURM job scheduler
- Access to
/iopsstorand/capstorstorage
For local development, use the --local flag with submit.sh.
[Add license information if applicable]
[Add citation information if applicable]