Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions bionemo-recipes/interpretability/sparse_autoencoders/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual environments
.venv/
venv/
ENV/
env/

# UV package manager
.python-version

# Jupyter Notebooks
.ipynb_checkpoints/

# IDE
.idea/
.vscode/
*.swp
*.swo
*~
.DS_Store

# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/
.nox/

# Weights & Biases
wandb/

# Node.js
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*
.npm
.yarn

# Model weights and checkpoints
*.pt
*.pth
*.ckpt
*.safetensors
*.bin
checkpoints/

# Data files (large)
*.parquet
*.arrow
*.h5
*.hdf5
*.fasta
*.fasta.gz
*.tsv.gz
features.json
interpretations.json
cluster_labels.json

# Activation stores
activations_store/
**/activations_store/

# Output directories
outputs/
**/outputs/
**/outputs_*/
analysis_output/
**/analysis_output/

# Data directories (but not source code data/ modules)
/data/
datasets/
recipes/*/data/
!**/src/**/data/

# Viewer public data (generated)
**/public/features.json
**/public/features_atlas.parquet
**/public/*.parquet
**/public/features/
**/public/features/*.json

# Logs
*.log
logs/

# Temporary files
*.tmp
*.temp
.cache/

# Environment variables
.env
.env.local
.env.*.local

# Keep specific files
!recipes/**/configs/**
!**/config.yaml
# Note: uv.lock is NOT ignored - it's useful for reproducibility
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# UV Workspace Configuration
# This is a monorepo workspace that contains multiple Python packages.
# For development: `uv sync` installs all packages
# For external users: Each package can be installed independently

[project]
name = "biosae-workspace"
version = "0.1.0"
requires-python = ">=3.10,<3.14"
dependencies = [
"duckdb>=1.4.3",
"hydra-core>=1.3.2",
"huggingface-hub>=0.25,<1.0",
"transformers>=4.44,<5.0",
"umap-learn>=0.5.11",
"wandb>=0.24.0",
]

[tool.uv.workspace]
members = ["sae", "recipes/esm2", "recipes/codonfm"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Recipes

This directory contains model-specific implementations of Sparse Autoencoders following the NVIDIA BioNeMo pattern. Each recipe is a self-contained package that builds on the generic `sae` core package.

## Structure

```
recipes/
├── README.md # This file
└── esm2/ # ESM2 protein language model recipe
├── pyproject.toml # Package configuration
├── README.md # ESM2-specific documentation
├── src/
│ └── esm2_sae/ # ESM2-specific implementation
├── scripts/ # Training scripts
├── configs/ # Hydra configuration files
└── data/ # Data directory
```

## Available Recipes

### ESM2 (`recipes/esm2/`)

Sparse Autoencoders for ESM2 protein language models.

**Features:**
- ESM2 model wrappers (8M to 3B parameters)
- Protein dataset loaders (FASTA, SwissProt)
- F1 evaluation against Swiss-Prot annotations
- Visualization pipeline for feature analysis
- Hydra-based training configs

**Quick Start:**
```bash
cd recipes/esm2
python scripts/train.py --config-name config_production
```

See `recipes/esm2/README.md` for detailed documentation.

## Recipe Philosophy

Each recipe follows these principles:

1. **Self-Contained**: Can be installed and used independently
2. **Depends on Core**: Imports from generic `sae` package for SAE implementations
3. **Domain-Specific**: Contains only model/domain-specific code
4. **Organized**: Following NVIDIA BioNeMo structure (src/, scripts/, configs/, data/)
5. **Documented**: Comprehensive README with examples

## Adding a New Recipe

To add a new recipe (e.g., for a different model):

1. **Create directory structure:**
```bash
mkdir -p recipes/mymodel/src/mymodel_sae
mkdir -p recipes/mymodel/{scripts,configs,data}
```

2. **Create `pyproject.toml`:**
```toml
[project]
name = "mymodel-sae"
dependencies = [
"sae>=0.1.0", # Depend on core SAE package
# Add model-specific dependencies
]
```

3. **Implement model-specific code:**
- `src/mymodel_sae/models/`: Model wrappers
- `src/mymodel_sae/data/`: Dataset loaders
- `src/mymodel_sae/eval/`: Domain-specific evaluation
- `scripts/`: Training scripts
- `configs/`: Hydra configs

4. **Update workspace:**
Add to root `pyproject.toml`:
```toml
[tool.uv.workspace]
members = ["sae", "recipes/esm2", "recipes/mymodel"]
```

5. **Document:**
Create `recipes/mymodel/README.md` with usage examples

## Recipe vs. Core

**Core SAE Package (`sae/`):**
- Generic SAE architectures (ReLU-L1, Top-K)
- Training loop and configuration
- Generic evaluation metrics
- No model/domain-specific code

**Recipe Packages (`recipes/*/`):**
- Model wrappers for embedding extraction
- Domain-specific data loaders
- Domain-specific evaluation metrics
- Training scripts with configs
- Visualization pipelines

## Installation

### Development (all recipes)
```bash
# From repository root
uv sync
```

### Individual recipe
```bash
# Install core first
pip install -e sae/

# Then install recipe
pip install -e recipes/esm2/
```

## Example: Training Pipeline

1. **Recipe provides model and data:**
```python
from esm2_sae.models import ESM2Model
from esm2_sae.data import download_swissprot, read_fasta
```

2. **Core provides SAE and training:**
```python
from sae.architectures import TopKSAE
from sae.training import Trainer, TrainingConfig
```

3. **Recipe provides domain-specific evaluation:**
```python
from esm2_sae.eval import compute_f1_scores
```

This separation keeps the core package minimal and domain-agnostic while allowing rich, domain-specific functionality in recipes.

## Future Recipes

Potential recipes to add:
- `recipes/geneformer/`: Sparse SAEs for Geneformer gene expression models
- `recipes/amplify/`: Sparse SAEs for NVIDIA BioNeMo AmplifyProt
- `recipes/esmc/`: Sparse SAEs for ESM-C protein folding models
- `recipes/vision/`: Sparse SAEs for vision transformers (non-bio example)

## Contributing

When adding a new recipe:
1. Follow the structure of `recipes/esm2/`
2. Keep domain-specific code in the recipe
3. Contribute generic improvements to `sae/`
4. Include comprehensive README
5. Add example scripts
6. Include configs for reproducibility

## License

MIT License
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.cache/
outputs/
wandb/
Loading
Loading