[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework

### Problem & Motivation

The BioNeMo Framework's [README](https://github.com/NVIDIA/bionemo-framework/blob/main/README.md) primarily details the initiation of a single Docker container for model training. However, the [codebase and benchmarks](https://github.com/NVIDIA/bionemo-framework/blob/main/ci/benchmarks/partial-conv/esm2_pretrain.yaml#L36)(e.g. `--num-nodes=${nodes} `) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.

### BioNeMo Framework Version

v2.3

### Category

Model/Training

### Proposed Solution

I propose an enhancement to the BioNeMo Framework's documentation to include:

Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.

Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.

Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.

### Expected Benefits

Implementing this enhancement would:

Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.

Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.

Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.

By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.



### Code Example

```python

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

Problem & Motivation

BioNeMo Framework Version

Category

Proposed Solution

Expected Benefits

Code Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

Description

Problem & Motivation

BioNeMo Framework Version

Category

Proposed Solution

Expected Benefits

Code Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions