Skip to content

Commit 5a09039

Browse files
committed
use changed-files, remove old uv stuff
Signed-off-by: Peter St. John <pstjohn@nvidia.com> initial recipe add Signed-off-by: Peter St. John <pstjohn@nvidia.com> updating changed-files, adding new CI script for recipes Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 23b0bbf commit 5a09039

File tree

104 files changed

+35631
-7180
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+35631
-7180
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
name: "BioNeMo Recipes CI"
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
- "pull-request/[0-9]+"
8+
- "dependabot/**"
9+
merge_group:
10+
types: [checks_requested]
11+
schedule:
12+
- cron: "0 7 * * *" # Runs at 7 AM UTC daily (12 AM MST)
13+
14+
defaults:
15+
run:
16+
shell: bash -x -e -u -o pipefail {0}
17+
18+
concurrency:
19+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
20+
cancel-in-progress: true
21+
22+
jobs:
23+
changed-files:
24+
if: github.event_name != 'schedule'
25+
runs-on: ubuntu-latest
26+
outputs:
27+
any_changed: ${{ steps.changed-files.outputs.changed_directories }}
28+
steps:
29+
- uses: actions/checkout@v4
30+
with:
31+
fetch-depth: 0
32+
- uses: step-security/changed-files@v46
33+
id: changed-files
34+
with:
35+
base_sha: main
36+
dir_names: true
37+
dir_names_max_depth: 2
38+
files: |
39+
'models/**'
40+
'recipes/**'
41+
- name: List all changed files
42+
env:
43+
ALL_CHANGED_DIRECTORIES: ${{ steps.changed-files.outputs.all_changed_directories }}
44+
run: |
45+
for directory in ${ALL_CHANGED_DIRECTORIES}; do
46+
echo "$directory was changed"
47+
done
48+
49+
pre-commit:
50+
runs-on: ubuntu-latest
51+
needs: changed-files
52+
if: needs.changed-files.outputs.any_changed == 'true'
53+
steps:
54+
- uses: actions/checkout@v4
55+
with:
56+
fetch-depth: 0
57+
- uses: actions/setup-python@v5
58+
with:
59+
python-version: "3.13"
60+
cache: "pip"
61+
- name: Setup UV
62+
uses: astral-sh/setup-uv@v6
63+
with:
64+
enable-cache: true
65+
- run: |
66+
uv tool install pre-commit --with pre-commit-uv --force-reinstall
67+
uv tool install tach>=0.9.0
68+
uv tool update-shell
69+
- run: ./ci/scripts/static_checks.sh
70+

.github/workflows/unit-tests.yml

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,51 @@ concurrency:
2020
cancel-in-progress: true
2121

2222
jobs:
23+
changed-files:
24+
if: github.event_name != 'schedule'
25+
runs-on: ubuntu-latest
26+
outputs:
27+
any_changed: ${{ steps.changed-files.outputs.any_changed }}
28+
steps:
29+
- uses: actions/checkout@v4
30+
with:
31+
fetch-depth: 0
32+
- uses: step-security/changed-files@v46
33+
id: changed-files
34+
with:
35+
base_sha: main
36+
files: |
37+
'!models/**'
38+
'!recipes/**'
39+
'!**.md'
40+
- name: List all changed files
41+
env:
42+
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
43+
run: |
44+
for file in ${ALL_CHANGED_FILES}; do
45+
echo "$file was changed"
46+
done
47+
2348
pre-commit:
2449
runs-on: ubuntu-latest
50+
needs: changed-files
51+
if: needs.changed-files.outputs.any_changed == 'true'
2552
steps:
2653
- uses: actions/checkout@v4
2754
with:
2855
fetch-depth: 0
29-
submodules: "recursive"
3056
- uses: actions/setup-python@v5
3157
with:
3258
python-version: "3.13"
3359
cache: "pip"
34-
- run: pip install -r requirements-dev.txt
60+
- name: Setup UV
61+
uses: astral-sh/setup-uv@v6
62+
with:
63+
enable-cache: true
64+
- run: |
65+
uv tool install pre-commit --with pre-commit-uv --force-reinstall
66+
uv tool install tach>=0.9.0
67+
uv tool update-shell
3568
- run: ./ci/scripts/static_checks.sh
3669

3770
# With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata.
@@ -65,8 +98,9 @@ jobs:
6598
needs:
6699
- pre-commit
67100
- get-pr-labels
101+
- changed-files
68102
runs-on: linux-amd64-cpu16
69-
if: ${{ !contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'SKIP_CI') }}
103+
if: ${{ !contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'SKIP_CI') && (needs.changed-files.outputs.any_changed == 'true' || needs.changed-files.result == 'skipped') }}
70104
steps:
71105
- name: Login to Docker Hub
72106
uses: docker/login-action@v3
@@ -185,6 +219,7 @@ jobs:
185219
needs:
186220
- build-bionemo-image
187221
- get-pr-labels
222+
- changed-files
188223
runs-on: linux-amd64-gpu-l4-latest-1
189224
if: |
190225
github.event_name == 'schedule' || github.event_name == 'merge_group' ||

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,15 @@ coverage.xml
194194
Thumbs.db
195195

196196
.python_history
197+
198+
# Any training results
199+
results/
200+
job_output/
201+
wandb/
202+
203+
# Any model checkpoints
204+
*.safetensors
205+
checkpoint_export/
206+
207+
# Hydra outputs
208+
outputs/

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,12 @@ expert-level support.
1515

1616
BioNeMo Framework is part of a larger ecosystem of NVIDIA Biopharma products. Get notified of new releases, bug fixes, critical security updates, and more for biopharma. [Subscribe.](https://www.nvidia.com/en-us/clara/biopharma/product-updates/)
1717

18+
> [!NOTE]
19+
> BioNeMo Recipes are now available, which demonstrate high-performance model training outside of the NeMo Framework.
20+
> The recipes show how to train models that derive from HuggingFace `PreTrainedModel` classes, and use
21+
> [NVIDIA TransformerEngine](https://github.com/NVIDIA/TransformerEngine) layers for optimized attention kernels. For
22+
> more information, see the [BioNeMo Recipes README](./bionemo-recipes.md).
23+
1824
## Structure of the Framework
1925

2026
The `bionemo-framework` is organized into independently installable namespace packages. These are located under the

bionemo-recipes.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# BioNemo Recipes
2+
3+
BioNemo Recipes provides an easy path for the biological foundation model training community to scale up transformer-based models efficiently. Rather than offering a batteries-included training framework, we provide **model checkpoints** with TransformerEngine layers and **training recipes** that demonstrate how to achieve maximum throughput with popular open-source frameworks.
4+
5+
## Overview
6+
7+
The biological AI community is actively prototyping model architectures and needs tooling that prioritizes extensibility, interoperability, and ease-of-use alongside performance. BioNemo Recipes addresses this by offering:
8+
9+
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
10+
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11+
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
12+
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
13+
14+
### Use Cases
15+
16+
- **Foundation Model Developers**: AI researchers and ML engineers developing novel biological foundation models who need to scale up prototypes efficiently
17+
- **Foundation Model Customizers**: Domain scientists looking to fine-tune existing models with proprietary data for drug discovery and biological research
18+
19+
## Repository Structure
20+
21+
This repository contains two types of components:
22+
23+
### Models (`models/`)
24+
25+
Huggingface-compatible `PreTrainedModel` classes that use TransformerEngine layers internally. These are designed to be:
26+
27+
- **Distributed via Hugging Face Hub**: Pre-converted checkpoints available at [huggingface.co/nvidia](https://huggingface.co/nvidia)
28+
- **Drop-in replacements**: Compatible with `AutoModel.from_pretrained()` without additional dependencies
29+
- **Performance optimized**: Leverage TransformerEngine features like FP8 training and context parallelism
30+
31+
Example models include ESM-2, Geneformer, and AMPLIFY.
32+
33+
### Recipes (`recipes/`)
34+
35+
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
36+
37+
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38+
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
39+
- **Scaling strategies**: Single-GPU to multi-node training patterns
40+
- **Benchmarked performance**: Validated throughput and convergence metrics
41+
42+
Recipes are **not pip-installable packages** but serve as reference implementations that users can adapt for their own research.
43+
44+
## Quick Start
45+
46+
### Using Models
47+
48+
```python
49+
from transformers import AutoModel, AutoTokenizer
50+
51+
# Load a BioNemo model directly from Hugging Face
52+
model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M")
53+
tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
54+
```
55+
56+
### Running Recipes
57+
58+
```bash
59+
# Navigate to a recipe
60+
cd recipes/esm2_native_te_nvfsdp
61+
62+
# Build and run
63+
docker build -t esm2_recipe .
64+
docker run --rm -it --gpus all esm2_recipe python train.py
65+
```
66+
67+
______________________________________________________________________
68+
69+
## Developer Guide
70+
71+
### Setting Up Development Environment
72+
73+
1. **Install pre-commit hooks:**
74+
75+
```bash
76+
pre-commit install
77+
```
78+
79+
Run hooks manually:
80+
81+
```bash
82+
pre-commit run --all-files
83+
```
84+
85+
2. **Test your changes:**
86+
Each model and recipe has its own build and test setup following this pattern:
87+
88+
```bash
89+
cd models/my_model # or recipes/my_recipe
90+
docker build . -t my_tag
91+
docker run --rm -it --gpus all my_tag pytest -v .
92+
```
93+
94+
### Coding Guidelines
95+
96+
We prioritize **readability and simplicity** over comprehensive feature coverage:
97+
98+
- **KISS over DRY**: It's better to have clear, duplicated code than complex abstractions
99+
- **One thing well**: Each recipe should demonstrate specific features clearly rather than trying to cover everything
100+
- **Self-contained**: Recipes cannot depend on cutting-edge code from other parts of the repository
101+
102+
### Testing Strategy
103+
104+
We use a three-tier testing approach:
105+
106+
#### L0 Tests (Pre-merge)
107+
108+
- **Purpose**: Fast validation that code works
109+
- **Runtime**: \<10 minutes, single GPU
110+
- **Frequency**: Run automatically on PRs
111+
- **Scope**: Basic functionality, checkpoint creation/loading
112+
113+
#### L1 Tests (Performance Monitoring)
114+
115+
- **Purpose**: Performance benchmarking and partial convergence validation
116+
- **Runtime**: Up to 4 hours, up to 16 GPUs
117+
- **Frequency**: Nightly/weekly
118+
- **Scope**: Throughput metrics, scaling validation
119+
120+
#### L2 Tests (Release Validation)
121+
122+
- **Purpose**: Full convergence and large-scale validation
123+
- **Runtime**: Multiple days, hundreds of GPUs
124+
- **Frequency**: Monthly or before releases
125+
- **Scope**: Complete model convergence, cross-platform validation
126+
127+
### Adding New Components
128+
129+
#### Adding a New Model
130+
131+
Models should be pip-installable packages that can export checkpoints to Hugging Face. See the
132+
[models README](models/README.md) for detailed guidelines on:
133+
134+
- Package structure and conventions
135+
- Checkpoint export procedures
136+
- Testing requirements
137+
- CI/CD integration
138+
139+
#### Adding a New Recipe
140+
141+
Recipes should be self-contained Docker environments demonstrating specific training patterns. See
142+
the [recipes README](recipes/README.md) for guidance on:
143+
144+
- Directory structure and naming
145+
- Hydra configuration management
146+
- Docker best practices
147+
- SLURM integration examples
148+
149+
### CI/CD Contract
150+
151+
All components must pass this basic validation:
152+
153+
```bash
154+
docker build -t {component_tag} .
155+
docker run --rm -it --gpus all {component_tag} pytest -v .
156+
```
157+
158+
#### Running CI/CD
159+
160+
To run the CI/CD pipeline locally, run the following command:
161+
162+
```bash
163+
./ci/build_and_test.py
164+
```
165+
166+
### Performance Expectations
167+
168+
We aim to provide the fastest available training implementations for biological foundation models, with documented benchmarks across NVIDIA hardware (A100, H100, H200, B100, B200, etc.).
169+
170+
## Contributing
171+
172+
We welcome contributions that advance the state of biological foundation model training. Please ensure your contributions:
173+
174+
1. Follow our coding guidelines emphasizing clarity
175+
2. Include appropriate tests (L0 minimum, L1/L2 as applicable)
176+
3. Provide clear documentation and examples
177+
4. Maintain compatibility with our supported frameworks
178+
179+
For detailed contribution guidelines, see our individual component READMEs:
180+
181+
- [Models Development Guide](models/README.md)
182+
- [Recipes Development Guide](recipes/README.md)
183+
184+
## License
185+
186+
[Add appropriate license information]
187+
188+
## Support
189+
190+
For technical support and questions:
191+
192+
- Check existing issues before opening a new one
193+
- Review our training recipes for implementation examples
194+
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies

0 commit comments

Comments
 (0)