[BENCHMARKS] fixing evo2 finetune config by dorotat-nv · Pull Request #1097 · NVIDIA/bionemo-framework

dorotat-nv · 2025-09-02T13:56:53Z

Description

Adding job type to wandb experiment for evo2 to improve grouping of experiments in wandb UI
Standardising how data is handled for esm2 and geneformer

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

# TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Added node-local, in-memory data staging for ESM2 and Geneformer pretraining benchmarks with automatic rank synchronization to reduce I/O contention and speed startup.
- Benchmarks now transparently read datasets from the staged path.
- Introduced runtime diagnostics and timing around data staging for better visibility.
Chores
- Updated experiment tracking configuration by adding job-type metadata and standardizing CLI flag ordering for consistency.

coderabbitai · 2025-09-02T13:57:01Z

Walkthrough

Adds per-node in-memory data staging and synchronization to two perf benchmark YAMLs, updating data paths to staged locations. Adjusts WandB CLI arguments in a partial-conv benchmark by reordering flags and introducing a wandb job type parameter.

Changes

Cohort / File(s)	Summary
Perf benchmarks: node-local data staging and sync `ci/benchmarks/perf/esm2_pretrain.yaml`, `ci/benchmarks/perf/geneformer_pretrain.yaml`	Add rank-0 copy of dataset to `/dev/shm/data_path_<nodename>`, create COPY_FLAG, and poll barrier for all ranks. Update training/validation data args to use `$NEW_DATA_PATH`. Add diagnostics (df, echo, time cp). Use `SLURMD_NODENAME` and `SLURM_LOCALID` to scope operations.
Partial-conv WandB CLI adjustments `ci/benchmarks/partial-conv/evo2_finetuning.yaml`	Reorder WandB flags and add `--wandb-job-type=${pipeline_label}`; project and group remain unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant R0 as Rank 0 (per node)
  participant Rn as Other Ranks (same node)
  participant FS as Node-local /dev/shm
  participant DS as Shared Dataset (${data_path})

  Note over R0, Rn: Start job on a node
  R0->>R0: Compute NEW_DATA_PATH=/dev/shm/data_path_<nodename>
  R0->>FS: time cp -r DS -> NEW_DATA_PATH
  R0->>FS: Create COPY_FLAG at NEW_DATA_PATH/.copy_done
  Rn->>FS: Poll for COPY_FLAG existence
  FS-->>Rn: COPY_FLAG detected
  Note over R0, Rn: All ranks proceed
  R0->>R0: Launch training using paths under NEW_DATA_PATH
  Rn->>Rn: Launch training using paths under NEW_DATA_PATH

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description is missing the detailed breakdown of changes for each benchmark configuration and retains the template placeholder comment, and it does not include the required usage example or completed pre-submit checklist items.	Please expand the “Description” section to detail the specific modifications in each YAML file, remove or replace the template placeholder, add an example usage snippet, and complete the pre-submit checklist before merging.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “[BENCHMARKS] fixing evo2 finetune config” clearly and concisely highlights the primary change to the evo2 finetuning configuration, avoiding unnecessary detail while remaining specific to the key fix.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Poem

I stash my carrots in /dev/shm, so swift, so bright, so near,
A flag pops up—copy’s done!—all bunnies lend an ear.
We hop in sync, our ranks align, the data paths now clear,
And with a wand(b) we tag the run, job-type crystal clear.
Benchmarks bloom—thump-thump—this hare’s frontier!

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dorotat/hotfix-jobtype-evo2finetune

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dorotat-nv · 2025-09-02T13:59:00Z

/ok to test a79d880

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)
98-98: Provide a safe default for wandb job type (optional).

Prevents empty job-type when the label isn’t set.

Apply:
-    --wandb-job-type=${pipeline_label} \
+    --wandb-job-type=${pipeline_label:-ci} \
ci/benchmarks/perf/geneformer_pretrain.yaml (1)
33-36: Optional: use rsync for faster, resilient node-local staging.

rsync gives progress and can skip unchanged files on retries.

Apply:
-      time cp -r ${data_path}/ $NEW_DATA_PATH;
+      mkdir -p "$NEW_DATA_PATH"
+      time rsync -a --delete "${data_path}/" "$NEW_DATA_PATH/"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a29272f and a79d880.

📒 Files selected for processing (2)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (1 hunks)
ci/benchmarks/perf/geneformer_pretrain.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-bionemo-image
GitHub Check: Analyze (rust)

🔇 Additional comments (3)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

96-98: W&B job type addition — LGTM

The new --wandb-job-type=${pipeline_label} flag is appropriate and keeps metadata consistent with project/group.

96-98: Confirm CI template variables are defined or injected
pipeline_label, wandb_project_name, and target are used in ci/benchmarks/partial-conv/evo2_finetuning.yaml (lines 96–98) but aren’t declared in its script_args; ensure your CI job or parent template provides those variables or add them explicitly.
ci/benchmarks/perf/geneformer_pretrain.yaml (1)
42-44: Verify dataset root paths before training
The parity check failed because $data_path and $SLURMD_NODENAME appear unset, so no directories were listed. Please re-run the comparison with the actual source and destination paths explicitly set. For example:
#!/usr/bin/env bash
set -e
echo "SRC (${data_path}):"; ls -1 "${data_path:?}" | sort | head
echo "DST (${NEW_DATA_PATH}):"; ls -1 "${NEW_DATA_PATH:?}" | sort | head
Ensure both data_path and NEW_DATA_PATH are exported correctly (and SLURMD_NODENAME if it’s part of the path) so you can confirm that the top-level directory structure matches what the trainer expects.

ci/benchmarks/perf/geneformer_pretrain.yaml

codecov-commenter · 2025-09-02T15:47:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.61%. Comparing base (a29272f) to head (a79d880).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1097      +/-   ##
==========================================
- Coverage   80.62%   80.61%   -0.01%     
==========================================
  Files         157      157              
  Lines       11079    11079              
==========================================
- Hits         8932     8931       -1     
- Misses       2147     2148       +1

see 1 file with indirect coverage changes

…e-evo2finetune

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

26-27: Good exclusion from run identifiers; consider clearer naming.

Keeping pckg_url and file_name_wheel out of the ID is right. Minor nit: file_name_wheel currently holds a package spec (not a wheel filename). Consider package_spec to avoid confusion.

59-60: Reproducibility and URL coupling.

Pin the package version (e.g., subquadratic-ops==x.y.z) for deterministic runs.

Include the URL scheme directly in pckg_url (e.g., https://…) so the install step doesn’t own that concern.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79d880 and a085e6e.

📒 Files selected for processing (1)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (4 hunks)

🔇 Additional comments (2)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

110-112: Ensure required variables are defined.
wandb_project_name and pipeline_label are used here but aren’t defined in any config or .env—confirm they’re set in your CI/CD environment or passed by the calling pipeline.

105-105: Confirm flag and package support upstream.

train.py defines --use-subquadratic_ops in its argument parser and applies args.use_subquadratic_ops to config_modifiers_init (lines 584–588, 771–773).

Ensure the subquadratic-ops wheel installs a subquadratic_ops module and that import subquadratic_ops (and any required symbols) succeeds at runtime.

ci/benchmarks/partial-conv/evo2_finetuning.yaml

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ci/benchmarks/perf/esm2_pretrain.yaml (1)
84-84: Bug: variable not expanded in tp argument

Brace style is wrong; {tp} will not expand. Use ${tp}.
-    --tensor-model-parallel-size={tp} \
+    --tensor-model-parallel-size=${tp} \

🧹 Nitpick comments (2)

ci/benchmarks/perf/esm2_pretrain.yaml (2)
44-55: Optional: prefer rsync and clean up after job

rsync can be faster/safer on large trees; also consider cleaning the per-node cache and flag on exit (rank 0).
+  # Optional replacement for cp:
+  # if rsync -a --info=stats2,progress2 "${data_path}/" "$NEW_DATA_PATH/"; then touch "$COPY_FLAG"; else exit 1; fi
+
+  # Optional cleanup (only local rank 0)
+  trap 'if [ "${SLURM_LOCALID:-0}" = "0" ]; then rm -rf "$NEW_DATA_PATH" "$COPY_FLAG"; fi' EXIT
44-55: Optional: check /dev/shm free space before copying

Prevent spurious OOM/ENOSPC by prechecking size vs available space; fall back to shared path if insufficient.
+  # Optional space check (bytes). Falls back to shared path if not enough space.
+  src_bytes=$(du -sb "${data_path}" | awk '{print $1}')
+  shm_free=$(df -B1 /dev/shm | awk 'NR==2{print $4}')
+  if [ "$shm_free" -lt "$src_bytes" ]; then
+    echo "WARN: /dev/shm has insufficient space ($shm_free < $src_bytes); using shared ${data_path}" >&2
+    NEW_DATA_PATH="${data_path}"
+    touch "$COPY_FLAG"
+  fi

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a085e6e and 8f222f6.

📒 Files selected for processing (1)

ci/benchmarks/perf/esm2_pretrain.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (2)

ci/benchmarks/perf/esm2_pretrain.yaml (2)

57-60: LGTM: dataset path substitution

Switching to NEW_DATA_PATH for train/valid paths is consistent with the staging logic.

Please confirm those four filenames exist under data_path so the copy preserves them 1:1.

80-80: LGTM: WandB job type

Adding job type improves grouping. Ensure pipeline_label is always set in this config matrix to avoid empty tags.

ci/benchmarks/perf/esm2_pretrain.yaml

dorotat-nv · 2025-09-08T14:18:15Z

/ok to test 2985bfd

dorotat-nv · 2025-09-09T11:08:00Z

/ok to test 4672198

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ci/benchmarks/perf/esm2_pretrain.yaml (1)
84-84: Bug: literal {tp} passed to CLI.

Use ${tp} so TP is applied.
-    --tensor-model-parallel-size={tp} \
+    --tensor-model-parallel-size=${tp} \

♻️ Duplicate comments (5)

ci/benchmarks/perf/geneformer_pretrain.yaml (2)

30-31: Fix invalid bash variable expansion.

Use ${SLURMD_NODENAME}, not ${{SLURMD_NODENAME}}.

-  COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
-  NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
+  COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}";
+  NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}";

38-41: Bound the wait loop to avoid indefinite hangs.

Add a timeout so non-root ranks don't block forever if the copy fails.

-  while [ ! -f $COPY_FLAG ]; do
-      sleep 1
-  done
+  start_time=$(date +%s); timeout="${COPY_TIMEOUT_SEC:-1800}"
+  while [ ! -f "$COPY_FLAG" ]; do
+      sleep 1
+      now=$(date +%s)
+      if (( now - start_time > timeout )); then
+          echo "Timed out waiting for data staging on node ${SLURMD_NODENAME}" >&2
+          exit 1
+      fi
+  done

ci/benchmarks/perf/esm2_pretrain.yaml (3)

44-49: Scope per job to avoid stale flags/collisions.

Include job id in paths; clean pre-existing artifacts.

-  COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}";
-  NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}";
+  JOB_ID="${SLURM_JOB_ID:-${SLURM_JOBID:-$$}}"
+  COPY_FLAG="/tmp/copy_done_${JOB_ID}_${SLURMD_NODENAME}"
+  NEW_DATA_PATH="/dev/shm/${JOB_ID}_data_${SLURMD_NODENAME}"
+  rm -f "$COPY_FLAG"; rm -rf "$NEW_DATA_PATH"

52-55: Bound the wait loop to avoid indefinite hangs.

Add a timeout to the barrier.

-  while [ ! -f $COPY_FLAG ]; do
-      sleep 1
-  done
+  TIMEOUT="${COPY_TIMEOUT_SEC:-1800}"; waited=0
+  while [ ! -f "$COPY_FLAG" ]; do
+      sleep 1; waited=$((waited+1))
+      if [ "$waited" -ge "$TIMEOUT" ]; then
+          echo "ERROR: timed out waiting for $COPY_FLAG" >&2
+          exit 1
+      fi
+  done

47-51: Guard the copy and only touch the flag on success; quote paths.

Prevent false-ready signals and partial datasets.

-      df -h;
-      echo $NEW_DATA_PATH;
-      time cp -r ${data_path}/ $NEW_DATA_PATH;
-      touch $COPY_FLAG
+      df -h
+      echo "$NEW_DATA_PATH"
+      mkdir -p "$NEW_DATA_PATH"
+      if time cp -a "${data_path}/." "$NEW_DATA_PATH/"; then
+        touch "$COPY_FLAG"
+      else
+        echo "ERROR: copy to $NEW_DATA_PATH failed" >&2
+        exit 1
+      fi

🧹 Nitpick comments (1)

ci/benchmarks/perf/geneformer_pretrain.yaml (1)
29-29: Harden script execution.

Enable strict mode to fail early on errors and unset vars.
-script: |-
+script: |-
+  set -euo pipefail

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d162d5 and 4672198.

📒 Files selected for processing (3)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (1 hunks)
ci/benchmarks/perf/esm2_pretrain.yaml (1 hunks)
ci/benchmarks/perf/geneformer_pretrain.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (1)

ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)

96-99: WandB job type flag—confirm CLI name and placement.

Validate that the training entrypoint accepts --wandb-job-type (kebab-case) and that moving WandB flags after --early-stop-on-step is supported.

If needed, I can scan the repo for the argparse/Hydra schema to confirm accepted flag names.

ci/benchmarks/perf/esm2_pretrain.yaml

ci/benchmarks/perf/geneformer_pretrain.yaml

hotfixes

a79d880

dorotat-nv requested a review from pstjohn as a code owner September 2, 2025 13:56

dorotat-nv added the ciflow:skip Skip all CI tests for this PR label Sep 2, 2025

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

ci/benchmarks/perf/geneformer_pretrain.yaml Show resolved Hide resolved

ci/benchmarks/perf/geneformer_pretrain.yaml Show resolved Hide resolved

dorotat-nv added 2 commits September 8, 2025 14:12

Merge remote-tracking branch 'origin/main' into dorotat/hotfix-jobtyp…

38b92e4

…e-evo2finetune

adding subquadratic to evo2 finetune

a085e6e

dorotat-nv changed the title ~~[BENCHMARKS] fixing labels~~ [BENCHMARKS] fixing evo2 finetune config Sep 8, 2025

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

ci/benchmarks/partial-conv/evo2_finetuning.yaml Outdated Show resolved Hide resolved

updating data path for esm2

8f222f6

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

ci/benchmarks/perf/esm2_pretrain.yaml Show resolved Hide resolved

ci/benchmarks/perf/esm2_pretrain.yaml Show resolved Hide resolved

moving subqyadratic update to new pr

2985bfd

dorotat-nv added the SKIP_CI label Sep 8, 2025

dorotat-nv requested a review from jwilber September 8, 2025 14:18

dorotat-nv enabled auto-merge September 8, 2025 14:29

pstjohn approved these changes Sep 8, 2025

View reviewed changes

dorotat-nv added this pull request to the merge queue Sep 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 8, 2025

Merge branch 'main' into dorotat/hotfix-jobtype-evo2finetune

4672198

dorotat-nv enabled auto-merge September 9, 2025 11:07

dorotat-nv added this pull request to the merge queue Sep 9, 2025

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

ci/benchmarks/perf/esm2_pretrain.yaml Show resolved Hide resolved

ci/benchmarks/perf/geneformer_pretrain.yaml Show resolved Hide resolved

Merged via the queue into main with commit 223b211 Sep 9, 2025
19 checks passed

dorotat-nv deleted the dorotat/hotfix-jobtype-evo2finetune branch September 9, 2025 12:58

Conversation

dorotat-nv commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Usage

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

dorotat-nv commented Sep 2, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dorotat-nv commented Sep 8, 2025

Uh oh!

Uh oh!

dorotat-nv commented Sep 9, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dorotat-nv commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

codecov-commenter commented Sep 2, 2025 •

edited

Loading