[BENCHMARKS] fixing evo2 finetune config#1097
Conversation
WalkthroughAdds per-node in-memory data staging and synchronization to two perf benchmark YAMLs, updating data paths to staged locations. Adjusts WandB CLI arguments in a partial-conv benchmark by reordering flags and introducing a wandb job type parameter. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant R0 as Rank 0 (per node)
participant Rn as Other Ranks (same node)
participant FS as Node-local /dev/shm
participant DS as Shared Dataset (${data_path})
Note over R0, Rn: Start job on a node
R0->>R0: Compute NEW_DATA_PATH=/dev/shm/data_path_<nodename>
R0->>FS: time cp -r DS -> NEW_DATA_PATH
R0->>FS: Create COPY_FLAG at NEW_DATA_PATH/.copy_done
Rn->>FS: Poll for COPY_FLAG existence
FS-->>Rn: COPY_FLAG detected
Note over R0, Rn: All ranks proceed
R0->>R0: Launch training using paths under NEW_DATA_PATH
Rn->>Rn: Launch training using paths under NEW_DATA_PATH
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test a79d880 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)
98-98: Provide a safe default for wandb job type (optional).Prevents empty job-type when the label isn’t set.
Apply:
- --wandb-job-type=${pipeline_label} \ + --wandb-job-type=${pipeline_label:-ci} \ci/benchmarks/perf/geneformer_pretrain.yaml (1)
33-36: Optional: use rsync for faster, resilient node-local staging.
rsyncgives progress and can skip unchanged files on retries.Apply:
- time cp -r ${data_path}/ $NEW_DATA_PATH; + mkdir -p "$NEW_DATA_PATH" + time rsync -a --delete "${data_path}/" "$NEW_DATA_PATH/"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml(1 hunks)ci/benchmarks/perf/geneformer_pretrain.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-bionemo-image
- GitHub Check: Analyze (rust)
🔇 Additional comments (3)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)
96-98: W&B job type addition — LGTMThe new
--wandb-job-type=${pipeline_label}flag is appropriate and keeps metadata consistent with project/group.
96-98: Confirm CI template variables are defined or injected
pipeline_label, wandb_project_name, and target are used in ci/benchmarks/partial-conv/evo2_finetuning.yaml (lines 96–98) but aren’t declared in its script_args; ensure your CI job or parent template provides those variables or add them explicitly.ci/benchmarks/perf/geneformer_pretrain.yaml (1)
42-44: Verify dataset root paths before training
The parity check failed because$data_pathand$SLURMD_NODENAMEappear unset, so no directories were listed. Please re-run the comparison with the actual source and destination paths explicitly set. For example:#!/usr/bin/env bash set -e echo "SRC (${data_path}):"; ls -1 "${data_path:?}" | sort | head echo "DST (${NEW_DATA_PATH}):"; ls -1 "${NEW_DATA_PATH:?}" | sort | headEnsure both
data_pathandNEW_DATA_PATHare exported correctly (andSLURMD_NODENAMEif it’s part of the path) so you can confirm that the top-level directory structure matches what the trainer expects.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1097 +/- ##
==========================================
- Coverage 80.62% 80.61% -0.01%
==========================================
Files 157 157
Lines 11079 11079
==========================================
- Hits 8932 8931 -1
- Misses 2147 2148 +1 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)
26-27: Good exclusion from run identifiers; consider clearer naming.Keeping
pckg_urlandfile_name_wheelout of the ID is right. Minor nit:file_name_wheelcurrently holds a package spec (not a wheel filename). Considerpackage_specto avoid confusion.
59-60: Reproducibility and URL coupling.
- Pin the package version (e.g.,
subquadratic-ops==x.y.z) for deterministic runs.- Include the URL scheme directly in
pckg_url(e.g.,https://…) so the install step doesn’t own that concern.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ci/benchmarks/partial-conv/evo2_finetuning.yaml(4 hunks)
🔇 Additional comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)
110-112: Ensure required variables are defined.
wandb_project_nameandpipeline_labelare used here but aren’t defined in any config or .env—confirm they’re set in your CI/CD environment or passed by the calling pipeline.
105-105: Confirm flag and package support upstream.
- train.py defines
--use-subquadratic_opsin its argument parser and appliesargs.use_subquadratic_opstoconfig_modifiers_init(lines 584–588, 771–773).- Ensure the
subquadratic-opswheel installs asubquadratic_opsmodule and thatimport subquadratic_ops(and any required symbols) succeeds at runtime.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ci/benchmarks/perf/esm2_pretrain.yaml (1)
84-84: Bug: variable not expanded in tp argumentBrace style is wrong;
{tp}will not expand. Use${tp}.- --tensor-model-parallel-size={tp} \ + --tensor-model-parallel-size=${tp} \
🧹 Nitpick comments (2)
ci/benchmarks/perf/esm2_pretrain.yaml (2)
44-55: Optional: prefer rsync and clean up after jobrsync can be faster/safer on large trees; also consider cleaning the per-node cache and flag on exit (rank 0).
+ # Optional replacement for cp: + # if rsync -a --info=stats2,progress2 "${data_path}/" "$NEW_DATA_PATH/"; then touch "$COPY_FLAG"; else exit 1; fi + + # Optional cleanup (only local rank 0) + trap 'if [ "${SLURM_LOCALID:-0}" = "0" ]; then rm -rf "$NEW_DATA_PATH" "$COPY_FLAG"; fi' EXIT
44-55: Optional: check /dev/shm free space before copyingPrevent spurious OOM/ENOSPC by prechecking size vs available space; fall back to shared path if insufficient.
+ # Optional space check (bytes). Falls back to shared path if not enough space. + src_bytes=$(du -sb "${data_path}" | awk '{print $1}') + shm_free=$(df -B1 /dev/shm | awk 'NR==2{print $4}') + if [ "$shm_free" -lt "$src_bytes" ]; then + echo "WARN: /dev/shm has insufficient space ($shm_free < $src_bytes); using shared ${data_path}" >&2 + NEW_DATA_PATH="${data_path}" + touch "$COPY_FLAG" + fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ci/benchmarks/perf/esm2_pretrain.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (2)
ci/benchmarks/perf/esm2_pretrain.yaml (2)
57-60: LGTM: dataset path substitutionSwitching to NEW_DATA_PATH for train/valid paths is consistent with the staging logic.
Please confirm those four filenames exist under data_path so the copy preserves them 1:1.
80-80: LGTM: WandB job typeAdding job type improves grouping. Ensure pipeline_label is always set in this config matrix to avoid empty tags.
|
/ok to test 2985bfd |
|
/ok to test 4672198 |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ci/benchmarks/perf/esm2_pretrain.yaml (1)
84-84: Bug: literal{tp}passed to CLI.Use
${tp}so TP is applied.- --tensor-model-parallel-size={tp} \ + --tensor-model-parallel-size=${tp} \
♻️ Duplicate comments (5)
ci/benchmarks/perf/geneformer_pretrain.yaml (2)
30-31: Fix invalid bash variable expansion.Use
${SLURMD_NODENAME}, not${{SLURMD_NODENAME}}.- COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}"; - NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}"; + COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}"; + NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}";
38-41: Bound the wait loop to avoid indefinite hangs.Add a timeout so non-root ranks don't block forever if the copy fails.
- while [ ! -f $COPY_FLAG ]; do - sleep 1 - done + start_time=$(date +%s); timeout="${COPY_TIMEOUT_SEC:-1800}" + while [ ! -f "$COPY_FLAG" ]; do + sleep 1 + now=$(date +%s) + if (( now - start_time > timeout )); then + echo "Timed out waiting for data staging on node ${SLURMD_NODENAME}" >&2 + exit 1 + fi + doneci/benchmarks/perf/esm2_pretrain.yaml (3)
44-49: Scope per job to avoid stale flags/collisions.Include job id in paths; clean pre-existing artifacts.
- COPY_FLAG="/tmp/copy_done_${SLURMD_NODENAME}"; - NEW_DATA_PATH="/dev/shm/data_path_${SLURMD_NODENAME}"; + JOB_ID="${SLURM_JOB_ID:-${SLURM_JOBID:-$$}}" + COPY_FLAG="/tmp/copy_done_${JOB_ID}_${SLURMD_NODENAME}" + NEW_DATA_PATH="/dev/shm/${JOB_ID}_data_${SLURMD_NODENAME}" + rm -f "$COPY_FLAG"; rm -rf "$NEW_DATA_PATH"
52-55: Bound the wait loop to avoid indefinite hangs.Add a timeout to the barrier.
- while [ ! -f $COPY_FLAG ]; do - sleep 1 - done + TIMEOUT="${COPY_TIMEOUT_SEC:-1800}"; waited=0 + while [ ! -f "$COPY_FLAG" ]; do + sleep 1; waited=$((waited+1)) + if [ "$waited" -ge "$TIMEOUT" ]; then + echo "ERROR: timed out waiting for $COPY_FLAG" >&2 + exit 1 + fi + done
47-51: Guard the copy and only touch the flag on success; quote paths.Prevent false-ready signals and partial datasets.
- df -h; - echo $NEW_DATA_PATH; - time cp -r ${data_path}/ $NEW_DATA_PATH; - touch $COPY_FLAG + df -h + echo "$NEW_DATA_PATH" + mkdir -p "$NEW_DATA_PATH" + if time cp -a "${data_path}/." "$NEW_DATA_PATH/"; then + touch "$COPY_FLAG" + else + echo "ERROR: copy to $NEW_DATA_PATH failed" >&2 + exit 1 + fi
🧹 Nitpick comments (1)
ci/benchmarks/perf/geneformer_pretrain.yaml (1)
29-29: Harden script execution.Enable strict mode to fail early on errors and unset vars.
-script: |- +script: |- + set -euo pipefail
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ci/benchmarks/partial-conv/evo2_finetuning.yaml(1 hunks)ci/benchmarks/perf/esm2_pretrain.yaml(1 hunks)ci/benchmarks/perf/geneformer_pretrain.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)
96-99: WandB job type flag—confirm CLI name and placement.Validate that the training entrypoint accepts
--wandb-job-type(kebab-case) and that moving WandB flags after--early-stop-on-stepis supported.If needed, I can scan the repo for the argparse/Hydra schema to confirm accepted flag names.
Description
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Usage
# TODO: Add code snippetPre-submit Checklist
Summary by CodeRabbit
New Features
Chores