Skip to content

[BENCHMARKS] fixing evo2 finetune config#1097

Merged
dorotat-nv merged 6 commits intomainfrom
dorotat/hotfix-jobtype-evo2finetune
Sep 9, 2025
Merged

[BENCHMARKS] fixing evo2 finetune config#1097
dorotat-nv merged 6 commits intomainfrom
dorotat/hotfix-jobtype-evo2finetune

Conversation

@dorotat-nv
Copy link
Copy Markdown
Collaborator

@dorotat-nv dorotat-nv commented Sep 2, 2025

Description

  1. Adding job type to wandb experiment for evo2 to improve grouping of experiments in wandb UI
  2. Standardising how data is handled for esm2 and geneformer

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

# TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Added node-local, in-memory data staging for ESM2 and Geneformer pretraining benchmarks with automatic rank synchronization to reduce I/O contention and speed startup.
    • Benchmarks now transparently read datasets from the staged path.
    • Introduced runtime diagnostics and timing around data staging for better visibility.
  • Chores

    • Updated experiment tracking configuration by adding job-type metadata and standardizing CLI flag ordering for consistency.

@dorotat-nv dorotat-nv requested a review from pstjohn as a code owner September 2, 2025 13:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Adds per-node in-memory data staging and synchronization to two perf benchmark YAMLs, updating data paths to staged locations. Adjusts WandB CLI arguments in a partial-conv benchmark by reordering flags and introducing a wandb job type parameter.

Changes

Cohort / File(s) Summary
Perf benchmarks: node-local data staging and sync
ci/benchmarks/perf/esm2_pretrain.yaml, ci/benchmarks/perf/geneformer_pretrain.yaml
Add rank-0 copy of dataset to /dev/shm/data_path_<nodename>, create COPY_FLAG, and poll barrier for all ranks. Update training/validation data args to use $NEW_DATA_PATH. Add diagnostics (df, echo, time cp). Use SLURMD_NODENAME and SLURM_LOCALID to scope operations.
Partial-conv WandB CLI adjustments
ci/benchmarks/partial-conv/evo2_finetuning.yaml
Reorder WandB flags and add --wandb-job-type=${pipeline_label}; project and group remain unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant R0 as Rank 0 (per node)
  participant Rn as Other Ranks (same node)
  participant FS as Node-local /dev/shm
  participant DS as Shared Dataset (${data_path})

  Note over R0, Rn: Start job on a node
  R0->>R0: Compute NEW_DATA_PATH=/dev/shm/data_path_<nodename>
  R0->>FS: time cp -r DS -> NEW_DATA_PATH
  R0->>FS: Create COPY_FLAG at NEW_DATA_PATH/.copy_done
  Rn->>FS: Poll for COPY_FLAG existence
  FS-->>Rn: COPY_FLAG detected
  Note over R0, Rn: All ranks proceed
  R0->>R0: Launch training using paths under NEW_DATA_PATH
  Rn->>Rn: Launch training using paths under NEW_DATA_PATH
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description is missing the detailed breakdown of changes for each benchmark configuration and retains the template placeholder comment, and it does not include the required usage example or completed pre-submit checklist items. Please expand the “Description” section to detail the specific modifications in each YAML file, remove or replace the template placeholder, add an example usage snippet, and complete the pre-submit checklist before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “[BENCHMARKS] fixing evo2 finetune config” clearly and concisely highlights the primary change to the evo2 finetuning configuration, avoiding unnecessary detail while remaining specific to the key fix.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Poem

I stash my carrots in /dev/shm, so swift, so bright, so near,
A flag pops up—copy’s done!—all bunnies lend an ear.
We hop in sync, our ranks align, the data paths now clear,
And with a wand(b) we tag the run, job-type crystal clear.
Benchmarks bloom—thump-thump—this hare’s frontier!

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dorotat/hotfix-jobtype-evo2finetune

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dorotat-nv dorotat-nv added the ciflow:skip Skip all CI tests for this PR label Sep 2, 2025
@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test a79d880

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)

98-98: Provide a safe default for wandb job type (optional).

Prevents empty job-type when the label isn’t set.

Apply:

-    --wandb-job-type=${pipeline_label} \
+    --wandb-job-type=${pipeline_label:-ci} \
ci/benchmarks/perf/geneformer_pretrain.yaml (1)

33-36: Optional: use rsync for faster, resilient node-local staging.

rsync gives progress and can skip unchanged files on retries.

Apply:

-      time cp -r ${data_path}/ $NEW_DATA_PATH;
+      mkdir -p "$NEW_DATA_PATH"
+      time rsync -a --delete "${data_path}/" "$NEW_DATA_PATH/"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a29272f and a79d880.

📒 Files selected for processing (2)
  • ci/benchmarks/partial-conv/evo2_finetuning.yaml (1 hunks)
  • ci/benchmarks/perf/geneformer_pretrain.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-bionemo-image
  • GitHub Check: Analyze (rust)
🔇 Additional comments (3)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

96-98: W&B job type addition — LGTM

The new --wandb-job-type=${pipeline_label} flag is appropriate and keeps metadata consistent with project/group.


96-98: Confirm CI template variables are defined or injected
pipeline_label, wandb_project_name, and target are used in ci/benchmarks/partial-conv/evo2_finetuning.yaml (lines 96–98) but aren’t declared in its script_args; ensure your CI job or parent template provides those variables or add them explicitly.

ci/benchmarks/perf/geneformer_pretrain.yaml (1)

42-44: Verify dataset root paths before training
The parity check failed because $data_path and $SLURMD_NODENAME appear unset, so no directories were listed. Please re-run the comparison with the actual source and destination paths explicitly set. For example:

#!/usr/bin/env bash
set -e
echo "SRC (${data_path}):"; ls -1 "${data_path:?}" | sort | head
echo "DST (${NEW_DATA_PATH}):"; ls -1 "${NEW_DATA_PATH:?}" | sort | head

Ensure both data_path and NEW_DATA_PATH are exported correctly (and SLURMD_NODENAME if it’s part of the path) so you can confirm that the top-level directory structure matches what the trainer expects.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.61%. Comparing base (a29272f) to head (a79d880).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1097      +/-   ##
==========================================
- Coverage   80.62%   80.61%   -0.01%     
==========================================
  Files         157      157              
  Lines       11079    11079              
==========================================
- Hits         8932     8931       -1     
- Misses       2147     2148       +1     

see 1 file with indirect coverage changes

@dorotat-nv dorotat-nv changed the title [BENCHMARKS] fixing labels [BENCHMARKS] fixing evo2 finetune config Sep 8, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

26-27: Good exclusion from run identifiers; consider clearer naming.

Keeping pckg_url and file_name_wheel out of the ID is right. Minor nit: file_name_wheel currently holds a package spec (not a wheel filename). Consider package_spec to avoid confusion.


59-60: Reproducibility and URL coupling.

  • Pin the package version (e.g., subquadratic-ops==x.y.z) for deterministic runs.
  • Include the URL scheme directly in pckg_url (e.g., https://…) so the install step doesn’t own that concern.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79d880 and a085e6e.

📒 Files selected for processing (1)
  • ci/benchmarks/partial-conv/evo2_finetuning.yaml (4 hunks)
🔇 Additional comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)

110-112: Ensure required variables are defined.
wandb_project_name and pipeline_label are used here but aren’t defined in any config or .env—confirm they’re set in your CI/CD environment or passed by the calling pipeline.


105-105: Confirm flag and package support upstream.

  • train.py defines --use-subquadratic_ops in its argument parser and applies args.use_subquadratic_ops to config_modifiers_init (lines 584–588, 771–773).
  • Ensure the subquadratic-ops wheel installs a subquadratic_ops module and that import subquadratic_ops (and any required symbols) succeeds at runtime.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ci/benchmarks/perf/esm2_pretrain.yaml (1)

84-84: Bug: variable not expanded in tp argument

Brace style is wrong; {tp} will not expand. Use ${tp}.

-    --tensor-model-parallel-size={tp} \
+    --tensor-model-parallel-size=${tp} \
🧹 Nitpick comments (2)
ci/benchmarks/perf/esm2_pretrain.yaml (2)

44-55: Optional: prefer rsync and clean up after job

rsync can be faster/safer on large trees; also consider cleaning the per-node cache and flag on exit (rank 0).

+  # Optional replacement for cp:
+  # if rsync -a --info=stats2,progress2 "${data_path}/" "$NEW_DATA_PATH/"; then touch "$COPY_FLAG"; else exit 1; fi
+
+  # Optional cleanup (only local rank 0)
+  trap 'if [ "${SLURM_LOCALID:-0}" = "0" ]; then rm -rf "$NEW_DATA_PATH" "$COPY_FLAG"; fi' EXIT

44-55: Optional: check /dev/shm free space before copying

Prevent spurious OOM/ENOSPC by prechecking size vs available space; fall back to shared path if insufficient.

+  # Optional space check (bytes). Falls back to shared path if not enough space.
+  src_bytes=$(du -sb "${data_path}" | awk '{print $1}')
+  shm_free=$(df -B1 /dev/shm | awk 'NR==2{print $4}')
+  if [ "$shm_free" -lt "$src_bytes" ]; then
+    echo "WARN: /dev/shm has insufficient space ($shm_free < $src_bytes); using shared ${data_path}" >&2
+    NEW_DATA_PATH="${data_path}"
+    touch "$COPY_FLAG"
+  fi
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a085e6e and 8f222f6.

📒 Files selected for processing (1)
  • ci/benchmarks/perf/esm2_pretrain.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (2)
ci/benchmarks/perf/esm2_pretrain.yaml (2)

57-60: LGTM: dataset path substitution

Switching to NEW_DATA_PATH for train/valid paths is consistent with the staging logic.

Please confirm those four filenames exist under data_path so the copy preserves them 1:1.


80-80: LGTM: WandB job type

Adding job type improves grouping. Ensure pipeline_label is always set in this config matrix to avoid empty tags.

@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 2985bfd

@dorotat-nv dorotat-nv requested a review from jwilber September 8, 2025 14:18
@dorotat-nv dorotat-nv enabled auto-merge September 8, 2025 14:29
@dorotat-nv dorotat-nv added this pull request to the merge queue Sep 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 8, 2025
@dorotat-nv dorotat-nv enabled auto-merge September 9, 2025 11:07
@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 4672198

@dorotat-nv dorotat-nv added this pull request to the merge queue Sep 9, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ci/benchmarks/perf/esm2_pretrain.yaml (1)

84-84: Bug: literal {tp} passed to CLI.

Use ${tp} so TP is applied.

-    --tensor-model-parallel-size={tp} \
+    --tensor-model-parallel-size=${tp} \
♻️ Duplicate comments (5)
ci/benchmarks/perf/geneformer_pretrain.yaml (2)

30-31: Fix invalid bash variable expansion.

Use ${SLURMD_NODENAME}, not ${{SLURMD_NODENAME}}.

-  COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
-  NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
+  COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}";
+  NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}";

38-41: Bound the wait loop to avoid indefinite hangs.

Add a timeout so non-root ranks don't block forever if the copy fails.

-  while [ ! -f $COPY_FLAG ]; do
-      sleep 1
-  done
+  start_time=$(date +%s); timeout="${COPY_TIMEOUT_SEC:-1800}"
+  while [ ! -f "$COPY_FLAG" ]; do
+      sleep 1
+      now=$(date +%s)
+      if (( now - start_time > timeout )); then
+          echo "Timed out waiting for data staging on node ${SLURMD_NODENAME}" >&2
+          exit 1
+      fi
+  done
ci/benchmarks/perf/esm2_pretrain.yaml (3)

44-49: Scope per job to avoid stale flags/collisions.

Include job id in paths; clean pre-existing artifacts.

-  COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}";
-  NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}";
+  JOB_ID="${SLURM_JOB_ID:-${SLURM_JOBID:-$$}}"
+  COPY_FLAG="/tmp/copy_done_${JOB_ID}_${SLURMD_NODENAME}"
+  NEW_DATA_PATH="/dev/shm/${JOB_ID}_data_${SLURMD_NODENAME}"
+  rm -f "$COPY_FLAG"; rm -rf "$NEW_DATA_PATH"

52-55: Bound the wait loop to avoid indefinite hangs.

Add a timeout to the barrier.

-  while [ ! -f $COPY_FLAG ]; do
-      sleep 1
-  done
+  TIMEOUT="${COPY_TIMEOUT_SEC:-1800}"; waited=0
+  while [ ! -f "$COPY_FLAG" ]; do
+      sleep 1; waited=$((waited+1))
+      if [ "$waited" -ge "$TIMEOUT" ]; then
+          echo "ERROR: timed out waiting for $COPY_FLAG" >&2
+          exit 1
+      fi
+  done

47-51: Guard the copy and only touch the flag on success; quote paths.

Prevent false-ready signals and partial datasets.

-      df -h;
-      echo $NEW_DATA_PATH;
-      time cp -r ${data_path}/ $NEW_DATA_PATH;
-      touch $COPY_FLAG
+      df -h
+      echo "$NEW_DATA_PATH"
+      mkdir -p "$NEW_DATA_PATH"
+      if time cp -a "${data_path}/." "$NEW_DATA_PATH/"; then
+        touch "$COPY_FLAG"
+      else
+        echo "ERROR: copy to $NEW_DATA_PATH failed" >&2
+        exit 1
+      fi
🧹 Nitpick comments (1)
ci/benchmarks/perf/geneformer_pretrain.yaml (1)

29-29: Harden script execution.

Enable strict mode to fail early on errors and unset vars.

-script: |-
+script: |-
+  set -euo pipefail
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d162d5 and 4672198.

📒 Files selected for processing (3)
  • ci/benchmarks/partial-conv/evo2_finetuning.yaml (1 hunks)
  • ci/benchmarks/perf/esm2_pretrain.yaml (1 hunks)
  • ci/benchmarks/perf/geneformer_pretrain.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (1)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)

96-99: WandB job type flag—confirm CLI name and placement.

Validate that the training entrypoint accepts --wandb-job-type (kebab-case) and that moving WandB flags after --early-stop-on-step is supported.

If needed, I can scan the repo for the argparse/Hydra schema to confirm accepted flag names.

Merged via the queue into main with commit 223b211 Sep 9, 2025
19 checks passed
@dorotat-nv dorotat-nv deleted the dorotat/hotfix-jobtype-evo2finetune branch September 9, 2025 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:skip Skip all CI tests for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants