Skip to content

Latest commit

 

History

History
112 lines (78 loc) · 5.32 KB

File metadata and controls

112 lines (78 loc) · 5.32 KB

Lepton CI

This directory holds code required for triggering automated partial-convergence/performance benchmarking runs in Lepton.

The dashboards may be viewd at the (internal only) url: nv/bionemo-dashboards.

They currently run on this schedule:

┌─────────────────────┬───────────────────────┐ │ Model │ Schedule │ ├─────────────────────┼───────────────────────┤ │ esm2_native_te_650m │ Mon/Wed/Fri (1am PST) │ ├─────────────────────┼───────────────────────┤ │ esm2_native_te_15b │ Mon/Wed/Fri (1am PST) │ ├─────────────────────┼───────────────────────┤ │ llama3_native_te_1b │ Tue/Thu (1am PST) │ ├─────────────────────┼───────────────────────┤ │ codonfm_ptl_te │ Tue/Thu (1am PST) │ └─────────────────────┴───────────────────────┘

with scdl-dataloader running nightly on a cpu runner.

Overview

Currently, there are two ongoing benchmark runs, each triggered nightly:

  • model_convergence: Partial convergence runs for bionemo-recipes. Use GPU resources on Lepton.
  • scdl_performance: Performance benchmarking runs for bionemo-scdl. Use CPU resources on Lepton.

The code is organized as follows:

ci/lepton
├── core
│   ├── __init__.py
│   ├── launch_job.py
│   ├── lepton_utils.py
│   └── utils.py
├── model_convergence
│   ├── configs
│   └── launchers
├── README.md
├── requirements.txt
└── scdl_performance
    ├── configs
    └── launchers
  • core/: Holds the core logic for triggering jobs to Lepton. It makes use of hydra configs.
  • model_convergence/:
    • configs/: model-specific configs.
    • launchers/: Logic to grab job-specific data and upload it to kratos.
  • scdl_performance/
    • configs/: Hydra configs detailing performance benchmarking.
    • launchers/: Logic to grab job-specific data and upload it to kratos.

Triggering Jobs

Each type of benchmark may run as follows:

  • Triggered locally from Python.
  • Triggered manually from GitHub Actions.

In addition, each job runs each morning at 1am PST on a schedule.

Note - if running the Python code locally, you will have to edit the secrets reference in the configs/base.yaml files. For this reason, creating a branch and triggering it from the Github Action is the preferred method of development.

Model Convergence

Python Trigger

To run locally, call core/launch_job.py, providing the path to the config directory and the config name:

# call launch_job with specified config
python ci/lepton/core/launch_job.py \
    --config-path="../model_convergence/configs" \
    --config-name="recipes/codonfm_ptl_te"

Github Actions Trigger

The GH Action is defined in .github/workflows/convergence-tests.yml. To trigger the GH Actions, you may trigger the action manually from github and supply the provided information.

If you are developing a new config, simply create the new config (following the structures of the others), and provide that branch to the GitHub action. If you created a new config file, you will also have to add that as an option in the convergence-tests.yml dropdown. (If you do edit convergence-tests.yml, make sure to use that as the branch for the Use workflow from option).

The job also runs every night on a schedule.

SCDL Performance

Python Trigger

To run locally, call core/launch_job.py, providing the path to the config directory and the config name:

# call launch_job with specified config
python ci/lepton/core/launch_job.py \
    --config-path="../scdl_performance/configs" \
    --config-name="scdl"

Github Actions Trigger

The GH Action is defined in .github/workflows/scdl-performance-tests.yml. To trigger the GH Actions, you may trigger the action manually from github and supply the provided information.

If you are developing a new config, simply create the new config (following the structures of the others), and provide that branch to the GitHub action. If you created a new config file, you will also have to add that as an option in the scdl-performance-tests.yml dropdown. (If you do edit scdl-performance-tests.yml, make sure to use that as the branch for the Use workflow from option).

The job also runs every night on a schedule.