Skip to content

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

@hbsun2113

Description

@hbsun2113

Problem & Motivation

The BioNeMo Framework's README primarily details the initiation of a single Docker container for model training. However, the codebase and benchmarks(e.g. --num-nodes=${nodes} ) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.

BioNeMo Framework Version

v2.3

Category

Model/Training

Proposed Solution

I propose an enhancement to the BioNeMo Framework's documentation to include:

Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.

Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.

Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.

Expected Benefits

Implementing this enhancement would:

Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.

Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.

Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.

By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.

Code Example

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions