This directory holds code required for triggering automated partial-convergence/performance benchmarking runs in Lepton.
The dashboards may be viewd at the (internal only) url: nv/bionemo-dashboards.
They currently run on this schedule:
┌─────────────────────┬───────────────────────┐ │ Model │ Schedule │ ├─────────────────────┼───────────────────────┤ │ esm2_native_te_650m │ Mon/Wed/Fri (1am PST) │ ├─────────────────────┼───────────────────────┤ │ esm2_native_te_15b │ Mon/Wed/Fri (1am PST) │ ├─────────────────────┼───────────────────────┤ │ llama3_native_te_1b │ Tue/Thu (1am PST) │ ├─────────────────────┼───────────────────────┤ │ codonfm_ptl_te │ Tue/Thu (1am PST) │ └─────────────────────┴───────────────────────┘
with scdl-dataloader running nightly on a cpu runner.
Currently, there are two ongoing benchmark runs, each triggered nightly:
- model_convergence: Partial convergence runs for
bionemo-recipes. Use GPU resources on Lepton. - scdl_performance: Performance benchmarking runs for
bionemo-scdl. Use CPU resources on Lepton.
The code is organized as follows:
ci/lepton
├── core
│ ├── __init__.py
│ ├── launch_job.py
│ ├── lepton_utils.py
│ └── utils.py
├── model_convergence
│ ├── configs
│ └── launchers
├── README.md
├── requirements.txt
└── scdl_performance
├── configs
└── launchers
core/: Holds the core logic for triggering jobs to Lepton. It makes use of hydra configs.model_convergence/:configs/: model-specific configs.launchers/: Logic to grab job-specific data and upload it to kratos.
scdl_performance/configs/: Hydra configs detailing performance benchmarking.launchers/: Logic to grab job-specific data and upload it to kratos.
Each type of benchmark may run as follows:
- Triggered locally from Python.
- Triggered manually from GitHub Actions.
In addition, each job runs each morning at 1am PST on a schedule.
Note - if running the Python code locally, you will have to edit the secrets reference in the configs/base.yaml files. For this reason, creating a branch and triggering it from the Github Action is the preferred method of development.
To run locally, call core/launch_job.py, providing the path to the config directory and the config name:
# call launch_job with specified config
python ci/lepton/core/launch_job.py \
--config-path="../model_convergence/configs" \
--config-name="recipes/codonfm_ptl_te"
The GH Action is defined in .github/workflows/convergence-tests.yml. To trigger the GH Actions, you may trigger the action manually from github and supply the provided information.
If you are developing a new config, simply create the new config (following the structures of the others), and provide that branch to the GitHub action. If you created a new config file, you will also have to add that as an option in the convergence-tests.yml dropdown. (If you do edit convergence-tests.yml, make sure to use that as the branch for the Use workflow from option).
The job also runs every night on a schedule.
To run locally, call core/launch_job.py, providing the path to the config directory and the config name:
# call launch_job with specified config
python ci/lepton/core/launch_job.py \
--config-path="../scdl_performance/configs" \
--config-name="scdl"
The GH Action is defined in .github/workflows/scdl-performance-tests.yml. To trigger the GH Actions, you may trigger the action manually from github and supply the provided information.
If you are developing a new config, simply create the new config (following the structures of the others), and provide that branch to the GitHub action. If you created a new config file, you will also have to add that as an option in the scdl-performance-tests.yml dropdown. (If you do edit scdl-performance-tests.yml, make sure to use that as the branch for the Use workflow from option).
The job also runs every night on a schedule.