-
Notifications
You must be signed in to change notification settings - Fork 126
Description
Problem & Motivation
The BioNeMo Framework's README primarily details the initiation of a single Docker container for model training. However, the codebase and benchmarks(e.g. --num-nodes=${nodes} ) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.
BioNeMo Framework Version
v2.3
Category
Model/Training
Proposed Solution
I propose an enhancement to the BioNeMo Framework's documentation to include:
Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.
Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.
Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.
Expected Benefits
Implementing this enhancement would:
Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.
Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.
Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.
By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.