Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Configure CI behavior by applying the relevant labels:
- [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest
- [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing

> \[!NOTE\]
> [!NOTE]
> By default, the notebooks validation tests are skipped unless explicitly enabled.

#### Authorizing CI Runs
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
submodules: "recursive"
- uses: actions/setup-python@v5
with:
python-version: "3.12"
python-version: "3.13"
cache: "pip"
- run: pip install -r requirements-dev.txt
- run: ./ci/scripts/static_checks.sh
Expand Down
3 changes: 3 additions & 0 deletions .mdformat.toml
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
number = true # options: {false, true}
exclude = [
"docs/docs/index.md", # Exclude highly formatted index page
]
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ repos:
args: ["--fix"]
- id: ruff-format
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.17 # Use the latest stable version
rev: 0.7.22 # Use the latest stable version
hooks:
- id: mdformat
language_version: python3.13
additional_dependencies:
- mdformat-tables
- mdformat-gfm
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ We distribute a [development container](https://devcontainers.github.io/) config
(`.devcontainer/devcontainer.json`) that simplifies the process of local testing and development. Opening the
bionemo-framework folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.

> \[!NOTE\]
> [!NOTE]
> The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally
> (using the command shown above) will ensure that most of the layers are present in the local docker cache.

Expand Down
37 changes: 21 additions & 16 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,50 @@ hide:
- navigation
---


**NVIDIA BioNeMo Framework** is a collection of programming tools, libraries, and models for computational drug
discovery. It accelerates the most time-consuming and costly stages of building and adapting biomolecular AI models by
providing domain-specific, optimized models and tooling that are easily integrated into GPU-based computational
resources for the fastest performance on the market. You can access BioNeMo Framework as a free community resource or
learn more about getting an enterprise license for improved expert-level support at the
[BioNeMo homepage](https://www.nvidia.com/en-us/clara/bionemo/).


<div class="grid cards" markdown>

- :material-book-open-variant:{ .lg } __User Guide__
- :material-book-open-variant:{ .lg } __User Guide__

---

Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.

[Get Started](main/about/overview/){ .md-button .md-button }

______________________________________________________________________
- :material-code-greater-than:{ .lg } __API Reference__

Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.
---

[Get Started](main/about/overview/){ .md-button .md-button }
Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.

- :material-code-greater-than:{ .lg } __API Reference__
[API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }

______________________________________________________________________
- :material-cube-outline:{ .lg } __Models__

Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.
---

[API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }
Explore detailed instructions and best practices for using BioNeMo models in your research.

- :material-cube-outline:{ .lg } __Models__
[Explore Models](models){ .md-button .md-button }

______________________________________________________________________

Explore detailed instructions and best practices for using BioNeMo models in your research.

[Explore Models](models){ .md-button .md-button }
- :material-database-outline:{ .lg } __Datasets__

- :material-database-outline:{ .lg } __Datasets__
---

______________________________________________________________________
Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.

Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.
[Explore Datasets](main/datasets/){ .md-button .md-button }

[Explore Datasets](main/datasets/){ .md-button .md-button }

</div>
12 changes: 6 additions & 6 deletions docs/docs/main/about/background/megatron_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ would apply different random masks or different data augmentation strategies eac
provides some utilities that make multi-epoch training easier, while obeying the determinism requirements of
megatron.

The \[MultiEpochDatasetResampler\]\[bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler\] class simplifies the
The [MultiEpochDatasetResampler][bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler] class simplifies the
process of multi-epoch training, where the data should both be re-shuffled each epoch with different random effects
applied each time the data is seen. To be compatible with this resampler, the provided dataset class's `__getitem__`
method should accept a \[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] tuple that contains both an epoch
method should accept a [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] tuple that contains both an epoch
and index value. Random effects can then be performed by setting the torch random seed based on the epoch value:

```python
Expand All @@ -37,9 +37,9 @@ details.
```

For deterministic datasets that still want to train for multiple epochs with epoch-level shuffling, the
\[IdentityMultiEpochDatasetWrapper\]\[bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper\] class can
[IdentityMultiEpochDatasetWrapper][bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper] class can
simplify this process by wrapping a dataset that accepts integer indices and passes along the
\[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] index values from the resampled dataset.
[EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] index values from the resampled dataset.

```python
class MyDeterministicDataset:
Expand All @@ -53,7 +53,7 @@ for sample in MultiEpochDatasetResampler(dataset, num_epochs=3, shuffle=True):

## Training Resumption

To ensure identical behavior with and without job interruption, BioNeMo provides \[MegatronDataModule\]\[bionemo.llm.data.datamodule.MegatronDataModule\] to save and load state dict for training resumption, and provides \[WrappedDataLoader\]\[nemo.lightning.data.WrappedDataLoader\] to add a `mode` attribute to \[DataLoader\]\[torch.utils.data.DataLoader\].
To ensure identical behavior with and without job interruption, BioNeMo provides [MegatronDataModule][bionemo.llm.data.datamodule.MegatronDataModule] to save and load state dict for training resumption, and provides [WrappedDataLoader][nemo.lightning.data.WrappedDataLoader] to add a `mode` attribute to [DataLoader][torch.utils.data.DataLoader].

```python
class MyDataModule(MegatronDataModule):
Expand Down Expand Up @@ -100,7 +100,7 @@ WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not s
## Testing Datasets for Megatron Compatibility

BioNeMo also provides utility functions for test suites to validate that datasets conform to the megatron data model.
The \[assert_dataset_compatible_with_megatron\]\[bionemo.testing.data_utils.assert_dataset_compatible_with_megatron\]
The [assert_dataset_compatible_with_megatron][bionemo.testing.data_utils.assert_dataset_compatible_with_megatron]
function calls the dataset with identical indices and ensures the outputs are identical, while also checking to see if
`torch.manual_seed` was used.

Expand Down
32 changes: 16 additions & 16 deletions docs/docs/main/about/releasenotes-fw.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,26 +169,26 @@

### New Features

- \[Documentation\] Updated, executable ESM-2nv notebooks demonstrating: Data preprocessing and model training with custom datasets, Fine-tuning on FLIP data, Inference on OAS sequences, Pre-training from scratch and continuing training
- \[Documentation\] New notebook demonstrating Zero-Shot Protein Design Using ESM-2nv. Thank you to @awlange from A-Alpha Bio for contributing the original version of this recipe!
- [Documentation] Updated, executable ESM-2nv notebooks demonstrating: Data preprocessing and model training with custom datasets, Fine-tuning on FLIP data, Inference on OAS sequences, Pre-training from scratch and continuing training
- [Documentation] New notebook demonstrating Zero-Shot Protein Design Using ESM-2nv. Thank you to @awlange from A-Alpha Bio for contributing the original version of this recipe!

### Bug fixes and Improvements

- \[Geneformer\] Fixed bug in preprocessing due to a relocation of dependent artifacts.
- \[Geneformer\] Fixes bug in finetuning to use the newer preprocessing constructor.
- [Geneformer] Fixed bug in preprocessing due to a relocation of dependent artifacts.
- [Geneformer] Fixes bug in finetuning to use the newer preprocessing constructor.

## BioNeMo Framework v1.8

### New Features

- \[Documentation\] Updated, executable MolMIM notebooks demonstrating: Training on custom data, Inference and downstream prediction, ZINC15 dataset preprocesing, and CMA-ES optimization
- \[Dependencies\] Upgraded the framework to [NeMo v1.23](https://github.com/NVIDIA/NeMo/tree/v1.23.0), which updates PyTorch to version 2.2.0a0+81ea7a4 and CUDA to version 12.3.
- [Documentation] Updated, executable MolMIM notebooks demonstrating: Training on custom data, Inference and downstream prediction, ZINC15 dataset preprocesing, and CMA-ES optimization
- [Dependencies] Upgraded the framework to [NeMo v1.23](https://github.com/NVIDIA/NeMo/tree/v1.23.0), which updates PyTorch to version 2.2.0a0+81ea7a4 and CUDA to version 12.3.

### Bug fixes and Improvements

- \[ESM2\] Fixed a bug in gradient accumulation in encoder fine-tuning
- \[MegaMolBART\] Make MegaMolBART encoder finetuning respect random seed set by user
- \[MegaMolBART\] Finetuning with val_check_interval=1 bug fix
- [ESM2] Fixed a bug in gradient accumulation in encoder fine-tuning
- [MegaMolBART] Make MegaMolBART encoder finetuning respect random seed set by user
- [MegaMolBART] Finetuning with val_check_interval=1 bug fix

### Known Issues

Expand All @@ -204,8 +204,8 @@

### New Features

- \[EquiDock\] Remove steric clashes as a post-processing step after equidock inference.
- \[Documentation\] Updated Getting Started section which sequentially describes prerequisites, BioNeMo Framework access, startup instructions, and next steps.
- [EquiDock] Remove steric clashes as a post-processing step after equidock inference.
- [Documentation] Updated Getting Started section which sequentially describes prerequisites, BioNeMo Framework access, startup instructions, and next steps.

### Known Issues

Expand All @@ -215,11 +215,11 @@

### New Features

- \[Model Fine-tuning\] `model.freeze_layers` fine-tuning config parameter added to freeze a specified number of layers. Thank you to github user [@nehap25](https://github.com/nehap25)!
- \[ESM2\] Loading pre-trained ESM-2 weights and continue pre-training on the MLM objective on a custom FASTA dataset is now supported.
- \[OpenFold\] MLPerf feature 3.2 bug (mha_fused_gemm) fix has merged.
- \[OpenFold\] MLPerf feature 3.10 integrated into bionemo framework.
- \[DiffDock\] Updated data loading module for DiffDock model training, changing from sqlite3 backend to webdataset.
- [Model Fine-tuning] `model.freeze_layers` fine-tuning config parameter added to freeze a specified number of layers. Thank you to github user [@nehap25](https://github.com/nehap25)!
- [ESM2] Loading pre-trained ESM-2 weights and continue pre-training on the MLM objective on a custom FASTA dataset is now supported.
- [OpenFold] MLPerf feature 3.2 bug (mha_fused_gemm) fix has merged.
- [OpenFold] MLPerf feature 3.10 integrated into bionemo framework.
- [DiffDock] Updated data loading module for DiffDock model training, changing from sqlite3 backend to webdataset.

## BioNeMo Framework v1.5

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/main/contributing/code-review.md
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ a fruitful interaction across the team members.
that will allow other platforms to continue working.

- Don't write commit messages that are vague or wouldn't make sense to
partners that read the logs. For example, do not write "\[topic\]
partners that read the logs. For example, do not write "[topic]
Bugfix" as your header in the commit message. Keep links to videos
out of the commit message. Again, partners are going to see these
logs and it does not make sense to link to something they will not
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/main/contributing/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sufficient code review before being merged.
## Developer Certificate of Origin (DCO)

We require that all contributors "sign-off" on their commits (not GPG signing, just adding the `-s | --signoff`
argument, or follow the instructions below for auto-signing). This sign-off certifies that you adhere to the Developer
argument, or follow the instructions below for auto-signing). This sign-off certifies that you adhere to the Developer
Certificate of Origin (DCO) ([full text](https://developercertificate.org/)); in short that the contribution is your
original work, or you have rights to submit it under the same license or a compatible license.

Expand Down Expand Up @@ -171,7 +171,7 @@ For both internal and external developers, the next step is opening a PR:
Note that versioned releases of TensorRT OSS are posted to `release/` branches of the upstream repo.
- Creation of a PR creation kicks off the code review process.
- At least one TensorRT engineer will be assigned for the review.
- While under review, mark your PRs as work-in-progress by prefixing the PR title with \[WIP\].
- While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].
2. Once ready, CI can be started by a developer with permissions when they add a `/build-ci` comment. This must pass
prior to merging.

Expand Down
10 changes: 5 additions & 5 deletions docs/docs/main/datasets/uniprot.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# UniProt Dataset

The UniProt Knowledgebase (UniProtKB) is an open database of protein sequences curated from translated genomic data \[1\].
The UniProt Reference Cluster (UniRef) databases provide clustered sets of sequences from UniProtKB \[2\], which have been
The UniProt Knowledgebase (UniProtKB) is an open database of protein sequences curated from translated genomic data [1].
The UniProt Reference Cluster (UniRef) databases provide clustered sets of sequences from UniProtKB [2], which have been
used in previous large language model training studies to improve diversity in protein training data. UniRef clusters
proteins hierarchically. At the highest level, UniRef100 groups proteins with identical primary sequences from the
UniProt Archive (UniParc). UniRef90 clusters these unique sequences into buckets with 90% sequence similarity, selecting
Expand All @@ -10,7 +10,7 @@ UniRef90 representative sequences into groups with 50% sequence similarity.

## Data Used for ESM-2 Pre-training

Since the original train/test splits from ESM-2 were not available \[3\], we replicated the ESM-2 pre-training experiments
Since the original train/test splits from ESM-2 were not available [3], we replicated the ESM-2 pre-training experiments
with UniProt's 2024_03 release. Following the approach described by the ESM-2 authors, we removed artificial sequences
and reserved 0.5% of UniRef50 clusters for validation. From the 65,672,139 UniRef50 clusters, this resulted in 328,360
validation sequences. We then ran MMSeqs to further ensure no contamination of the training set with sequences similar
Expand All @@ -22,7 +22,7 @@ randomly chosen UniRef90 sequence from each.
## Data Availability

Two versions of the dataset are distributed, a full training dataset (~80GB) and a 10,000 UniRef50 cluster random slice
(~150MB). To load and use the sanity dataset, use the \[bionemo.core.data.load\]\[bionemo.core.data.load.load\] function
(~150MB). To load and use the sanity dataset, use the [bionemo.core.data.load][bionemo.core.data.load.load] function
to materialize the sanity dataset in the BioNeMo2 cache directory:

```python
Expand All @@ -34,7 +34,7 @@ sanity_data_dir = load("esm2/testdata_esm2_pretrain:2.0")
### NGC Resource Links

- [Sanity Dataset](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/esm2_pretrain_nemo2_testdata/files)
- \[Full Dataset\]
- [Full Dataset]

## References

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/main/getting-started/initialization-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ Below we explain some common `docker run` options and how to use them as part of

### Mounting Volumes with the `-v` Option

The `-v` allows you to mount a host machine's directory as a volume inside the
The `-v` allows you to mount a host machine's directory as a volume inside the
container. This enables data persistence even after the container is deleted or restarted. In the context of machine
learning workflows, leveraging the `-v` option is essential for maintaining a local cache of datasets, model weights, and
results on the host machine such that they can persist after the container terminates and be reused across container
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/main/getting-started/training-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ bionemo-esm2-train

### Running

First off, we have a utility function for downloading full/test data and model checkpoints called `download_bionemo_data` that our following examples currently use. This will download the object if it is not already on your local system, and then return the path either way. For example if you run this twice in a row, you should expect the second time you run it to return the path almost instantly.
First off, we have a utility function for downloading full/test data and model checkpoints called `download_bionemo_data` that our following examples currently use. This will download the object if it is not already on your local system, and then return the path either way. For example if you run this twice in a row, you should expect the second time you run it to return the path almost instantly.

**NOTE**: NVIDIA employees should use `pbss` rather than `ngc` for the data source.

Expand Down
10 changes: 5 additions & 5 deletions docs/docs/models/ESM-2/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,17 @@ These models are ready for commercial use.
### Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
for this application and use case \[1\]; see link to [Non-NVIDIA Model Card for ESM-2 3B model](https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [Non-NVIDIA Model Card for ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D)
for this application and use case [1]; see link to [Non-NVIDIA Model Card for ESM-2 3B model](https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [Non-NVIDIA Model Card for ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D)

### References

\[1\] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and dos
[1] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and dos
Santos Costa, A., 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science,
379(6637), pp.1123-1130.

\[2\] "UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.
[2] "UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.

\[3\] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for
[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.

### Model Architecture
Expand Down Expand Up @@ -65,7 +65,7 @@ acid.
- NVIDIA Hopper
- NVIDIA Volta

**\[Preferred/Supported\] Operating System(s)**
**[Preferred/Supported] Operating System(s)**

- Linux

Expand Down
Loading
Loading