|
| 1 | +# BioNemo Recipes |
| 2 | + |
| 3 | +BioNemo Recipes provides an easy path for the biological foundation model training community to scale up transformer-based models efficiently. Rather than offering a batteries-included training framework, we provide **model checkpoints** with TransformerEngine layers and **training recipes** that demonstrate how to achieve maximum throughput with popular open-source frameworks. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The biological AI community is actively prototyping model architectures and needs tooling that prioritizes extensibility, interoperability, and ease-of-use alongside performance. BioNemo Recipes addresses this by offering: |
| 8 | + |
| 9 | +- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations |
| 10 | +- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch |
| 11 | +- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency |
| 12 | +- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments |
| 13 | + |
| 14 | +### Use Cases |
| 15 | + |
| 16 | +- **Foundation Model Developers**: AI researchers and ML engineers developing novel biological foundation models who need to scale up prototypes efficiently |
| 17 | +- **Foundation Model Customizers**: Domain scientists looking to fine-tune existing models with proprietary data for drug discovery and biological research |
| 18 | + |
| 19 | +## Repository Structure |
| 20 | + |
| 21 | +This repository contains two types of components: |
| 22 | + |
| 23 | +### Models (`models/`) |
| 24 | + |
| 25 | +Huggingface-compatible `PreTrainedModel` classes that use TransformerEngine layers internally. These are designed to be: |
| 26 | + |
| 27 | +- **Distributed via Hugging Face Hub**: Pre-converted checkpoints available at [huggingface.co/nvidia](https://huggingface.co/nvidia) |
| 28 | +- **Drop-in replacements**: Compatible with `AutoModel.from_pretrained()` without additional dependencies |
| 29 | +- **Performance optimized**: Leverage TransformerEngine features like FP8 training and context parallelism |
| 30 | + |
| 31 | +Example models include ESM-2, Geneformer, and AMPLIFY. |
| 32 | + |
| 33 | +### Recipes (`recipes/`) |
| 34 | + |
| 35 | +Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with: |
| 36 | + |
| 37 | +- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning |
| 38 | +- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing |
| 39 | +- **Scaling strategies**: Single-GPU to multi-node training patterns |
| 40 | +- **Benchmarked performance**: Validated throughput and convergence metrics |
| 41 | + |
| 42 | +Recipes are **not pip-installable packages** but serve as reference implementations that users can adapt for their own research. |
| 43 | + |
| 44 | +## Quick Start |
| 45 | + |
| 46 | +### Using Models |
| 47 | + |
| 48 | +```python |
| 49 | +from transformers import AutoModel, AutoTokenizer |
| 50 | + |
| 51 | +# Load a BioNemo model directly from Hugging Face |
| 52 | +model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M") |
| 53 | +tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M") |
| 54 | +``` |
| 55 | + |
| 56 | +### Running Recipes |
| 57 | + |
| 58 | +```bash |
| 59 | +# Navigate to a recipe |
| 60 | +cd recipes/esm2_native_te_nvfsdp |
| 61 | + |
| 62 | +# Build and run |
| 63 | +docker build -t esm2_recipe . |
| 64 | +docker run --rm -it --gpus all esm2_recipe python train.py |
| 65 | +``` |
| 66 | + |
| 67 | +______________________________________________________________________ |
| 68 | + |
| 69 | +## Developer Guide |
| 70 | + |
| 71 | +### Setting Up Development Environment |
| 72 | + |
| 73 | +1. **Install pre-commit hooks:** |
| 74 | + |
| 75 | + ```bash |
| 76 | + pre-commit install |
| 77 | + ``` |
| 78 | + |
| 79 | + Run hooks manually: |
| 80 | + |
| 81 | + ```bash |
| 82 | + pre-commit run --all-files |
| 83 | + ``` |
| 84 | + |
| 85 | +2. **Test your changes:** |
| 86 | + Each model and recipe has its own build and test setup following this pattern: |
| 87 | + |
| 88 | + ```bash |
| 89 | + cd models/my_model # or recipes/my_recipe |
| 90 | + docker build . -t my_tag |
| 91 | + docker run --rm -it --gpus all my_tag pytest -v . |
| 92 | + ``` |
| 93 | + |
| 94 | +### Coding Guidelines |
| 95 | + |
| 96 | +We prioritize **readability and simplicity** over comprehensive feature coverage: |
| 97 | + |
| 98 | +- **KISS over DRY**: It's better to have clear, duplicated code than complex abstractions |
| 99 | +- **One thing well**: Each recipe should demonstrate specific features clearly rather than trying to cover everything |
| 100 | +- **Self-contained**: Recipes cannot depend on cutting-edge code from other parts of the repository |
| 101 | + |
| 102 | +### Testing Strategy |
| 103 | + |
| 104 | +We use a three-tier testing approach: |
| 105 | + |
| 106 | +#### L0 Tests (Pre-merge) |
| 107 | + |
| 108 | +- **Purpose**: Fast validation that code works |
| 109 | +- **Runtime**: \<10 minutes, single GPU |
| 110 | +- **Frequency**: Run automatically on PRs |
| 111 | +- **Scope**: Basic functionality, checkpoint creation/loading |
| 112 | + |
| 113 | +#### L1 Tests (Performance Monitoring) |
| 114 | + |
| 115 | +- **Purpose**: Performance benchmarking and partial convergence validation |
| 116 | +- **Runtime**: Up to 4 hours, up to 16 GPUs |
| 117 | +- **Frequency**: Nightly/weekly |
| 118 | +- **Scope**: Throughput metrics, scaling validation |
| 119 | + |
| 120 | +#### L2 Tests (Release Validation) |
| 121 | + |
| 122 | +- **Purpose**: Full convergence and large-scale validation |
| 123 | +- **Runtime**: Multiple days, hundreds of GPUs |
| 124 | +- **Frequency**: Monthly or before releases |
| 125 | +- **Scope**: Complete model convergence, cross-platform validation |
| 126 | + |
| 127 | +### Adding New Components |
| 128 | + |
| 129 | +#### Adding a New Model |
| 130 | + |
| 131 | +Models should be pip-installable packages that can export checkpoints to Hugging Face. See the |
| 132 | +[models README](models/README.md) for detailed guidelines on: |
| 133 | + |
| 134 | +- Package structure and conventions |
| 135 | +- Checkpoint export procedures |
| 136 | +- Testing requirements |
| 137 | +- CI/CD integration |
| 138 | + |
| 139 | +#### Adding a New Recipe |
| 140 | + |
| 141 | +Recipes should be self-contained Docker environments demonstrating specific training patterns. See |
| 142 | +the [recipes README](recipes/README.md) for guidance on: |
| 143 | + |
| 144 | +- Directory structure and naming |
| 145 | +- Hydra configuration management |
| 146 | +- Docker best practices |
| 147 | +- SLURM integration examples |
| 148 | + |
| 149 | +### CI/CD Contract |
| 150 | + |
| 151 | +All components must pass this basic validation: |
| 152 | + |
| 153 | +```bash |
| 154 | +docker build -t {component_tag} . |
| 155 | +docker run --rm -it --gpus all {component_tag} pytest -v . |
| 156 | +``` |
| 157 | + |
| 158 | +#### Running CI/CD |
| 159 | + |
| 160 | +To run the CI/CD pipeline locally, run the following command: |
| 161 | + |
| 162 | +```bash |
| 163 | +./ci/build_and_test.py |
| 164 | +``` |
| 165 | + |
| 166 | +### Performance Expectations |
| 167 | + |
| 168 | +We aim to provide the fastest available training implementations for biological foundation models, with documented benchmarks across NVIDIA hardware (A100, H100, H200, B100, B200, etc.). |
| 169 | + |
| 170 | +## Contributing |
| 171 | + |
| 172 | +We welcome contributions that advance the state of biological foundation model training. Please ensure your contributions: |
| 173 | + |
| 174 | +1. Follow our coding guidelines emphasizing clarity |
| 175 | +2. Include appropriate tests (L0 minimum, L1/L2 as applicable) |
| 176 | +3. Provide clear documentation and examples |
| 177 | +4. Maintain compatibility with our supported frameworks |
| 178 | + |
| 179 | +For detailed contribution guidelines, see our individual component READMEs: |
| 180 | + |
| 181 | +- [Models Development Guide](models/README.md) |
| 182 | +- [Recipes Development Guide](recipes/README.md) |
| 183 | + |
| 184 | +## License |
| 185 | + |
| 186 | +[Add appropriate license information] |
| 187 | + |
| 188 | +## Support |
| 189 | + |
| 190 | +For technical support and questions: |
| 191 | + |
| 192 | +- Check existing issues before opening a new one |
| 193 | +- Review our training recipes for implementation examples |
| 194 | +- Consult the TransformerEngine and nvFSDP documentation for underlying technologies |
0 commit comments