Recipes Directory

This directory contains self-contained training examples that demonstrate best practices for scaling biological foundation models using TransformerEngine and megatron-FSDP). Each recipe is a complete Docker environment with benchmarked training scripts that users can learn from and adapt for their own research.

Philosophy

Recipes are designed as educational reference implementations rather than production training frameworks. Our guiding principles:

Code as Documentation

Users will read your code far more often than they execute it directly. Prioritize clarity and educational value:

Write code that demonstrates training features and techniques
Include detailed comments explaining design decisions
Make trade-offs explicit in your implementation
Minimize branching logic and complex abstractions

Self-Contained Simplicity

Each recipe is a completely isolated environment:

No shared dependencies between recipes
No bionemo-recipes imports from other parts of bionemo-recipes
Everything needed to run training is included in the recipe directory
Pinned dependencies for reproducible results. Eventually we will use a uv lockfile to make automated package updates easier.

KISS (Keep It Simple) over DRY (Don't Repeat Yourself)

Prioritize readability and educational value over code reuse:

Duplicate code between recipes is fine if it makes the training script more readable
Keep it simple - users should understand the full training loop at a glance
One concept per recipe - don't try to demonstrate every feature in one script. If multiple features share most of their common infrastructure, use a single recipe folder and multiple train_{feature}.py entrypoints.

Adding a New Recipe

Recipe Naming Convention

Follow this naming pattern to clearly communicate what your recipe demonstrates:

{model_name}_{training_framework}_{key_features}/

Examples:

esm2_native_te_mfsdp/ - ESM-2 with vanilla PyTorch, TransformerEngine, and megatron-fsdp
geneformer_lightning_context_parallel/ - Geneformer with PyTorch Lightning and context parallelism

Required Directory Structure

Each recipe must follow this example layout:

recipes/{recipe_name}/
├── README.md                                 # Recipe-specific documentation
├── Dockerfile                                # Self-contained training environment
├── .dockerignore                             # Docker build optimization
├── .ruff.toml                                # Code formatting configuration
├── requirements.txt                          # Pinned Python dependencies
├── hydra_config/                             # Training configuration management
│   ├── defaults.yaml                         # Shared configuration defaults
│   ├── L0_sanity.yaml                        # Fast CI/CD test config
│   ├── L1_{model_size}_perf.yaml             # Performance benchmark config
│   ├── L1_{model_size}_partial_conv.yaml     # Partial convergence test config (optional)
│   └── L2_{model_size}_full_conv.yaml        # Full convergence test config (optional)
├── train.py                                  # Main training entrypoint
├── train_{feature}.py                        # Alternative training scripts (optional)
├── test_train.py                             # L0 and L1 test suite
├── modeling_{model}.py                       # Model definition (if needed)
├── dataset.py                                # Data loading utilities
├── train.parquet                             # Small test dataset (< 5MB)
└── slurm.sh                                  # Example multi-node SLURM script

Implementation Requirements

Self-Contained Docker Environment

Your Dockerfile should create a complete, reproducible training environment:

FROM nvcr.io/nvidia/pytorch:26.02-py3

# Install dependencies with caching for faster builds
RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=bind,source=requirements.txt,target=/requirements.txt \
    PIP_CONSTRAINT= pip install -r /requirements.txt

# Set workspace to avoid pytest conflicts
WORKDIR /workspace/bionemo
COPY . .

# Default command for interactive development
CMD ["/bin/bash"]

Key requirements:

Use the latest NVIDIA PyTorch base image unless specific requirements dictate otherwise
Pin all dependencies in requirements.txt for reproducibility
Include everything needed to run training without external dependencies
Optimize for Docker layer caching

Readable Training Scripts

Your train.py should be educational and self-explanatory:

#!/usr/bin/env python3
"""
ESM-2 training with TransformerEngine and megatron-fsdp.

This script demonstrates how to:
1. Load and prepare biological sequence data
2. Initialize ESM-2 with TransformerEngine layers
3. Configure megatron-fsdp for memory-efficient multi-GPU training
4. Implement a training loop with proper checkpointing

Key design decisions:
- We use megatron-fsdp ZeRO-3 for maximum memory efficiency
- TransformerEngine FP8 is enabled for H100+ hardware
- Context parallelism handles long biological sequences
"""

import hydra
from omegaconf import DictConfig
import torch
from torch.distributed import init_process_group, destroy_process_group


@hydra.main(config_path="hydra_config", config_name="L0_sanity", version_base="1.2")
def main(args: DictConfig):
    """Main training entrypoint."""

    # 1. Initialize distributed training
    init_process_group(backend="nccl")

    # 2. Load model and data (with clear explanations)
    model = load_model(args.model_config)
    dataloader = create_dataloader(args.data_config)

    # 3. Configure optimization strategy
    optimizer = setup_optimizer(model, args.optimizer_config)

    # 4. Training loop with checkpointing
    train_loop(model, dataloader, optimizer, args)

    destroy_process_group()


if __name__ == "__main__":
    main()

Key principles:

Comprehensive docstrings explaining what the recipe demonstrates
Inline comments for non-obvious design decisions
Modular functions with clear responsibilities
Error handling for common failure modes
Progress logging so users understand training status

Hydra Configuration Management

Use Hydra for clean, hierarchical configuration management:

`defaults.yaml` - Shared Configuration

# Model configuration
model:
  name: esm2_t12_35M_UR50D
  use_te: true
  fp8_training: false

# Training configuration
training:
  micro_batch_size: 4
  gradient_accumulation_steps: 1
  num_train_steps: 1000
  save_interval: 100

# Optimization configuration
optimizer:
  type: adamw
  lr: 1e-4
  weight_decay: 0.01
  betas: [0.9, 0.98]

# Distributed training
distributed:
  backend: nccl
  mfsdp:
    enable: true
    sharding_strategy: zero3

# Logging
wandb:
  project: bionemo-recipes
  mode: online

`L0_sanity.yaml` - Fast Development Testing

defaults:
  - defaults

# Override for fast testing
model:
  name: esm2_t6_8M_UR50D  # Smallest model for speed

training:
  micro_batch_size: 2
  num_train_steps: 5      # Minimal steps for CI/CD
  save_interval: 5

wandb:
  mode: offline           # No external logging in CI

`L1_benchmark.yaml` - Performance Validation

defaults:
  - defaults

# Configuration for performance benchmarking
model:
  name: esm2_t12_35M_UR50D
  fp8_training: true      # Enable FP8 for performance

training:
  micro_batch_size: 8
  num_train_steps: 100    # Enough steps for stable metrics

wandb:
  name: "esm2_mfsdp_benchmark"
  tags: ["L1", "benchmark", "performance"]

Whenever possible, initialize objects by directly passing config objects as **kwargs:

model = MyModel(**config.model_kwargs)
optimizer = AdamW(**config.optimizer_kwargs)

Comprehensive Testing

Ensure the following tests are done when implementing.

L0 Tests - Fast CI/CD Validation

Your test_train.py must include L0 tests that run quickly in CI/CD. Include both tests that execute main in the same process, as well as tests that call torchrun / accelerate launch as we expect a user to do.

def test_train_main_in_same_process(monkeypatch, session_temp_dir: Path):
    """Test that train.py runs successfully with sanity config and creates expected outputs."""

    # Get the recipe directory
    recipe_dir = Path(__file__).parent

    # Set required environment variables for distributed training
    monkeypatch.setenv("LOCAL_RANK", "0")
    monkeypatch.setenv("RANK", "0")
    monkeypatch.setenv("WORLD_SIZE", "1")
    monkeypatch.setenv("MASTER_ADDR", "localhost")
    monkeypatch.setenv("MASTER_PORT", "29500")
    monkeypatch.setenv("WANDB_MODE", "disabled")
    monkeypatch.setenv("ACCELERATE_MIXED_PRECISION", "fp8")
    monkeypatch.setenv("ACCELERATE_FP8_BACKEND", "TE")

    with initialize_config_dir(
        config_dir=str(recipe_dir / "hydra_config"), version_base="1.2"
    ):
        sanity_config = compose(
            config_name="L0_sanity",
            overrides=[f"trainer.output_dir={session_temp_dir}"],
        )

    main(sanity_config)


@pytest.mark.parametrize(
    "accelerate_config", ["some_config.yaml", "some_other_config.yaml"]
)
def test_accelerate_launch(accelerate_config, tmp_path):
    """Test that accelerate launch runs successfully."""
    # Run 'accelerate launch train.py' as a subprocess
    subprocess.run(
        [
            sys.executable,
            "-m",
            "accelerate.commands.launch",
            "--config_file",
            str(accelerate_config_path),
            "train.py",
            "--config-name",
            "L0_sanity",
            f"trainer.output_dir={tmp_path}",
        ],
        cwd=recipe_dir,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        check=True,
        timeout=240,
    )

L1 Benchmark Tests

L1 tests should be specified using a L1_{model_name}_{test_type}.yaml config file. These should be workflows that will be launched using a common SLURM or Lepton batch script, complete in under 4 hours, and have a clear set of performance metrics to validate.

Cluster-Agnostic SLURM Script

Provide a reference SLURM script that works across different cluster configurations:

#!/bin/bash
#SBATCH --nodes=2                         # number of nodes
#SBATCH --ntasks-per-node=1    	          # n tasks per machine (one task per gpu) <required>
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00                   # wall time
#SBATCH --mem=0                 	      # all mem avail

set -x -e
ulimit -c 0

export GPUS_PER_NODE=8
export CMD="accelerate launch \
    --config_file accelerate_config/some_config.yaml \
    --machine_rank "\$SLURM_NODEID" \
    --num_machines "$SLURM_NNODES" \
    --main_process_ip "\$SLURM_SRUN_COMM_HOST" \
    --main_process_port 12340 \
    --num_processes "$(( $SLURM_NNODES * $GPUS_PER_NODE ))" \
    train.py
"

# Mount a persistent cache directory to cache dataset downloads and transformations.
export CACHE_DIR=<cache_dir>

srun \
  --container-image=<image_name> \
  --container-mounts=${PWD}:/workspace/bionemo,$HOME/.netrc:/root/.netrc,$CACHE_DIR:/root/.cache/huggingface \
  bash -c "$CMD"

SLURM Script Requirements:

Don't expose any internal cluster-specific hardware details, environment variables, or file paths.
Demonstrate how to properly format and launch a multi-node SLURM job with the given framework's launcher (e.g. accelerate launch, torchrun, etc.).

Quality Standards

Documentation Requirements

Each recipe must include a detailed README.md covering:

What it demonstrates: Clear statement of the training techniques shown
Hardware requirements: Minimum and recommended GPU configurations
Performance expectations: Benchmark results on reference hardware
Configuration options: How to modify the recipe for different use cases
Troubleshooting: Common issues and solutions

Performance Benchmarking

Document performance metrics for your recipe, for example:

## Performance Benchmarks

### Single Node (8x H100)
- **Throughput**: 2,500 tokens/sec
- **Memory Usage**: 45GB per GPU
- **Model**: ESM-2 650M parameters
- **Batch Size**: 32 (micro_batch_size=4, gradient_accumulation=8)

### Multi Node (2x8 H100)
- **Throughput**: 4,800 tokens/sec
- **Scaling Efficiency**: 96%
- **Network**: InfiniBand

CI/CD Integration

Each recipe must pass the standard test contract, which is a simple pytest invocation:

cd recipes/my_recipe
docker build -t my_recipe .
docker run --rm -it --gpus all my_recipe pytest -v .

L0 tests run automatically on every PR and must complete in under 10 minutes. L1 tests run nightly on dedicated hardware and can take up to 4 hours.

Examples

For reference implementations, examine existing recipes:

esm2_native_te_mfsdp/: Comprehensive example showing vanilla PyTorch with TE and megatron-fsdp
geneformer_lightning_context_parallel/: PyTorch Lightning with context parallelism for long sequences

Best Practices

What Makes a Great Recipe

Educational value: Users learn something new about scaling biological models
Production relevance: Techniques are applicable to real research workflows
Performance validation: Benchmarked results demonstrate clear benefits
Adaptation friendly: Users can easily modify for their specific needs

Common Pitfalls to Avoid

Over-abstraction: Don't hide important details behind complex abstractions
Feature creep: Resist adding every possible feature to one recipe
Brittle dependencies: Pin versions and test with fresh environments

Remember: A great recipe teaches users how to scale their biological foundation models effectively. Focus on clarity, education, and practical applicability over comprehensive use case coverage or error handling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipes Directory

Philosophy

Code as Documentation

Self-Contained Simplicity

KISS (Keep It Simple) over DRY (Don't Repeat Yourself)

Adding a New Recipe

Recipe Naming Convention

Required Directory Structure

Implementation Requirements

Self-Contained Docker Environment

Readable Training Scripts

Hydra Configuration Management

`defaults.yaml` - Shared Configuration

`L0_sanity.yaml` - Fast Development Testing

`L1_benchmark.yaml` - Performance Validation

Comprehensive Testing

L0 Tests - Fast CI/CD Validation

L1 Benchmark Tests

Cluster-Agnostic SLURM Script

Quality Standards

Documentation Requirements

Performance Benchmarking

CI/CD Integration

Examples

Best Practices

What Makes a Great Recipe

Common Pitfalls to Avoid

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Recipes Directory

Philosophy

Code as Documentation

Self-Contained Simplicity

KISS (Keep It Simple) over DRY (Don't Repeat Yourself)

Adding a New Recipe

Recipe Naming Convention

Required Directory Structure

Implementation Requirements

Self-Contained Docker Environment

Readable Training Scripts

Hydra Configuration Management

defaults.yaml - Shared Configuration

L0_sanity.yaml - Fast Development Testing

L1_benchmark.yaml - Performance Validation

Comprehensive Testing

L0 Tests - Fast CI/CD Validation

L1 Benchmark Tests

Cluster-Agnostic SLURM Script

Quality Standards

Documentation Requirements

Performance Benchmarking

CI/CD Integration

Examples

Best Practices

What Makes a Great Recipe

Common Pitfalls to Avoid

`defaults.yaml` - Shared Configuration

`L0_sanity.yaml` - Fast Development Testing

`L1_benchmark.yaml` - Performance Validation