This directory contains model-specific implementations of Sparse Autoencoders following the NVIDIA BioNeMo pattern. Each recipe is a self-contained package that builds on the generic sae core package.
recipes/
├── README.md # This file
└── esm2/ # ESM2 protein language model recipe
├── pyproject.toml # Package configuration
├── README.md # ESM2-specific documentation
├── src/
│ └── esm2_sae/ # ESM2-specific implementation
├── scripts/ # Training scripts
├── configs/ # Hydra configuration files
└── data/ # Data directory
Sparse Autoencoders for ESM2 protein language models.
Features:
- ESM2 model wrappers (8M to 3B parameters)
- Protein dataset loaders (FASTA, SwissProt)
- F1 evaluation against Swiss-Prot annotations
- Visualization pipeline for feature analysis
- Hydra-based training configs
Quick Start:
cd recipes/esm2
python scripts/train.py --config-name config_productionSee recipes/esm2/README.md for detailed documentation.
Each recipe follows these principles:
- Self-Contained: Can be installed and used independently
- Depends on Core: Imports from generic
saepackage for SAE implementations - Domain-Specific: Contains only model/domain-specific code
- Organized: Following NVIDIA BioNeMo structure (src/, scripts/, configs/, data/)
- Documented: Comprehensive README with examples
To add a new recipe (e.g., for a different model):
-
Create directory structure:
mkdir -p recipes/mymodel/src/mymodel_sae mkdir -p recipes/mymodel/{scripts,configs,data} -
Create
pyproject.toml:[project] name = "mymodel-sae" dependencies = [ "sae>=0.1.0", # Depend on core SAE package # Add model-specific dependencies ]
-
Implement model-specific code:
src/mymodel_sae/models/: Model wrapperssrc/mymodel_sae/data/: Dataset loaderssrc/mymodel_sae/eval/: Domain-specific evaluationscripts/: Training scriptsconfigs/: Hydra configs
-
Update workspace: Add to root
pyproject.toml:[tool.uv.workspace] members = ["sae", "recipes/esm2", "recipes/mymodel"]
-
Document: Create
recipes/mymodel/README.mdwith usage examples
Core SAE Package (sae/):
- Generic SAE architectures (ReLU-L1, Top-K)
- Training loop and configuration
- Generic evaluation metrics
- No model/domain-specific code
Recipe Packages (recipes/*/):
- Model wrappers for embedding extraction
- Domain-specific data loaders
- Domain-specific evaluation metrics
- Training scripts with configs
- Visualization pipelines
# From repository root
uv sync# Install core first
pip install -e sae/
# Then install recipe
pip install -e recipes/esm2/-
Recipe provides model and data:
from esm2_sae.models import ESM2Model from esm2_sae.data import download_swissprot, read_fasta
-
Core provides SAE and training:
from sae.architectures import TopKSAE from sae.training import Trainer, TrainingConfig
-
Recipe provides domain-specific evaluation:
from esm2_sae.eval import compute_f1_scores
This separation keeps the core package minimal and domain-agnostic while allowing rich, domain-specific functionality in recipes.
Potential recipes to add:
recipes/geneformer/: Sparse SAEs for Geneformer gene expression modelsrecipes/amplify/: Sparse SAEs for NVIDIA BioNeMo AmplifyProtrecipes/esmc/: Sparse SAEs for ESM-C protein folding modelsrecipes/vision/: Sparse SAEs for vision transformers (non-bio example)
When adding a new recipe:
- Follow the structure of
recipes/esm2/ - Keep domain-specific code in the recipe
- Contribute generic improvements to
sae/ - Include comprehensive README
- Add example scripts
- Include configs for reproducibility
MIT License