apertus-probes

A project for training and evaluating linear probes on language model activations to predict model errors (hallucinations). This codebase supports multiple datasets, models, and probe configurations, with capabilities for mixed-dataset training, cross-dataset evaluation, and activation steering.

Result

The result of the project are presented in the Report (PDF)

Project Overview

This project implements a pipeline for:

Extracting and caching activations from language models on various tasks
Training linear probes (Lasso/Linear Regression/Logistic Regression) on these activations to predict model errors
Evaluating probe performance across layers, token positions, and datasets
Analyzing results with comprehensive visualization tools
Steering model behavior using trained probes (optional)

The probes learn to predict error metrics (like softmax error or cross-entropy) from hidden layer activations, enabling analysis of where and how models encode uncertainty and potential errors.

Supported Models

The project supports the following model aliases (defined in submit.sh):

apertus-instruct → swiss-ai/Apertus-8B-Instruct-2509
apertus-base → swiss-ai/Apertus-8B-2509
llama-instruct → meta-llama/Llama-3.1-8B-Instruct
llama-base → meta-llama/Llama-3.1-8B

You can also use full model names directly with --model_name.

Supported Datasets

sms_spam - SMS spam classification
mmlu_high_school - MMLU high school level questions
mmlu_professional - MMLU professional level questions
ARC-Easy - ARC Easy science questions
ARC-Challenge - ARC Challenge science questions
sujet_finance_yesno_5k - Finance yes/no questions

Quick Start

Prerequisites

Environment Setup: The project uses a container environment. Copy the sample config:
```
cp probes.toml ~/.edf/probes.toml
```

Load Compute Node (if using SLURM):

srun -A infra01 --environment=$HOME/.edf/probes.toml --pty bash

Install Dependencies:
```
pip install -r requirements.txt
```

Workflow

The typical workflow consists of three main steps:

1. Cache Model Activations

Extract and save activations from a language model on a dataset:

./submit.sh cache --model apertus-instruct --dataset_name mmlu_professional

Or run locally:

./submit.sh --local cache --model apertus-instruct --dataset_name mmlu_professional

This extracts activations for all layers and saves them to $SCRATCH/mera-runs/.

2. Postprocess Cached Data

Aggregate and organize the cached activations:

./submit.sh postprocess --model apertus-instruct --dataset_name mmlu_professional

3. Train Probes

Train linear probes on the cached activations. You can train on:

Single datasets
Mixed datasets (multiple datasets combined)
Cross-dataset (train on one, test on another)

Example: Single dataset

./submit.sh run_probes --model apertus-instruct --datasets mmlu_professional

Example: Mixed datasets

./submit.sh run_probes --model apertus-instruct --datasets mmlu_professional ARC-Challenge sms_spam --save-name linear_intercept --alphas 0.02 0.05

Example: Cross-dataset probes

./submit.sh cross_dataset_probes --model apertus-instruct --train-dataset mmlu_professional --test-dataset ARC-Challenge

Analysis and Visualization

Use the Jupyter notebooks in src/probes/ to analyze results:

analyze.ipynb - Main analysis notebook with plotting utilities
plot_utils.py - Comprehensive plotting functions for RMSE and accuracy comparisons

The notebooks support:

Comparing probe performance across layers
Evaluating different probe models (Lasso with various alpha values)
Analyzing exact vs. last token positions
Visualizing mixed dataset results
Cross-dataset generalization analysis

Main Scripts

`submit.sh` - Main Submission Script

The primary interface for running experiments. It handles:

Model and dataset validation
SLURM job submission (or local execution)
Automatic job naming
Parameter forwarding

Usage:

./submit.sh <task> --model <alias> --dataset_name <dataset> [options]

Tasks:

cache - Extract and cache model activations
postprocess - Postprocess cached activations
run_probes - Train probes (single or mixed datasets)
cross_dataset_probes - Train cross-dataset probes

Common Options:

--local - Run locally instead of submitting to SLURM
--time <HH:MM:SS> - Time limit for SLURM jobs
--gpus <N> - Number of GPUs
--list-models - List available model aliases
--list-datasets - List available datasets
--show-defaults <task> - Show default parameters for a task

Examples:

# List available options
./submit.sh --list-models
./submit.sh --list-datasets
./submit.sh --show-defaults run_probes

# Run experiments
./submit.sh cache --model apertus-instruct --dataset_name mmlu_professional
./submit.sh run_probes --model llama-instruct --datasets ARC-Challenge ARC-Easy --alphas 0.02 0.05 --max-workers 40

`train_probes_all.sh`

Example script showing various probe training commands (commented out). Useful as a reference for different experiment configurations.

Key Parameters

Probe Training (`run_probes`)

--datasets - One or more dataset names (space-separated)
--alphas - Lasso regularization values (e.g., 0.02 0.05 0.1)
--save-name - Suffix for output files (e.g., linear_intercept, logit_intercept)
--token-pos - Token positions: exact, last, or both (default: exact)
--error-type - Error metric: SM (softmax) or CE (cross-entropy) (default: SM)
--max-workers - Parallel workers for training (default: 25)
--seed - Random seed (default: 52)
--nr-attempts - Number of train/test splits per layer (default: 5)
--transform-targets - Apply logit transformation to targets (default: enabled)
--normalize-features - Standardize features (default: enabled)

Caching (`cache`)

--batch-size - Batch size for processing (default: 1)
--device - Device to use (default: cuda:0)
--nr-samples - Number of samples to process (default: 2000)

Project Structure

apertus-probes/
├── src/
│   ├── cache/           # Activation caching and extraction
│   │   ├── cache_run.py
│   │   ├── cache_postprocess.py
│   │   └── cache_utils.py
│   ├── probes/          # Probe training and evaluation
│   │   ├── probes_train.py
│   │   ├── probes_core.py
│   │   ├── run_probes.py
│   │   ├── run_cross_dataset_probes.py
│   │   ├── plot_utils.py
│   │   └── analyze.ipynb
│   ├── steering/        # Activation steering (optional)
│   │   └── steering_run.py
│   └── tasks/           # Dataset and task handlers
│       └── task_handler.py
├── scripts/             # Dataset-specific scripts
├── submit.sh           # Main submission script
├── train_probes_all.sh # Example commands
├── probes.toml         # Environment configuration
└── requirements.txt    # Python dependencies

Output Locations

All outputs are saved to $SCRATCH/mera-runs/:

Cached activations: $SCRATCH/mera-runs/<dataset>/<model>/
Probe results: $SCRATCH/mera-runs/mix/<dataset>/<model>/df_probes_*.pkl
Cross-dataset probes: $SCRATCH/mera-runs/cross_dataset/<train>_to_<test>/<model>/

Probe Types

The project supports several probe model types:

Lasso Regression (L-<alpha>) - For regression tasks (predicting error values)
- Examples: L-0.02, L-0.05, L-0.1, L-0.25, L-0.5
Linear Regression (L-0) - Unregularized regression
Logistic Regression (LogReg-l1) - For classification tasks

Probes can predict:

Regression targets: Softmax error (SM), Cross-entropy (CE)
Classification targets: Accuracy, AUC-ROC (for classification tasks)

Analysis Workflow

Train probes using submit.sh run_probes

Load results in analyze.ipynb:

from plot_utils import plot_rmse_comparison_multi, plot_rmse_on_axis

Visualize using plotting functions:
- Single dataset comparisons
- Multi-dataset comparisons
- Layer-wise performance
- Token position analysis (exact vs. last)

Tips

Use --local flag for quick testing before submitting large jobs
Check job status with squeue -u $USER
View logs in logs/ directory
For mixed datasets, the probe is trained on concatenated data from all specified datasets
Cross-dataset probes evaluate generalization: train on one dataset, test on another

Environment

The project is designed for use on CSCS systems with:

Container environment (specified in probes.toml)
SLURM job scheduler
Access to /iopsstor and /capstor storage

For local development, use the --local flag with submit.sh.

License

[Add license information if applicable]

Citation

[Add citation information if applicable]

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
DSL_Project_16 - Final_Report.pdf		DSL_Project_16 - Final_Report.pdf
Dockerfile		Dockerfile
README.md		README.md
cache.sbatch		cache.sbatch
copy_df_probes.sh		copy_df_probes.sh
docs.md		docs.md
investigate_completions.py		investigate_completions.py
probes.toml		probes.toml
requirements.txt		requirements.txt
run_cross_dataset_probes.sh		run_cross_dataset_probes.sh
run_task.sh		run_task.sh
shared		shared
srun_compute.sh		srun_compute.sh
submit.sh		submit.sh
train_probes_all.sh		train_probes_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apertus-probes

Result

Project Overview

Supported Models

Supported Datasets

Quick Start

Prerequisites

Workflow

1. Cache Model Activations

2. Postprocess Cached Data

3. Train Probes

Analysis and Visualization

Main Scripts

`submit.sh` - Main Submission Script

`train_probes_all.sh`

Key Parameters

Probe Training (`run_probes`)

Caching (`cache`)

Project Structure

Output Locations

Probe Types

Analysis Workflow

Tips

Environment

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

swiss-ai/apertus-probes

Folders and files

Latest commit

History

Repository files navigation

apertus-probes

Result

Project Overview

Supported Models

Supported Datasets

Quick Start

Prerequisites

Workflow

1. Cache Model Activations

2. Postprocess Cached Data

3. Train Probes

Analysis and Visualization

Main Scripts

submit.sh - Main Submission Script

train_probes_all.sh

Key Parameters

Probe Training (run_probes)

Caching (cache)

Project Structure

Output Locations

Probe Types

Analysis Workflow

Tips

Environment

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`submit.sh` - Main Submission Script

`train_probes_all.sh`

Probe Training (`run_probes`)

Caching (`cache`)

Packages