feat: Add Evo2 fine-tuning partial-conv benchmarking#1028
Conversation
Signed-off-by: nvmvle <mvle@nvidia.com>
3fdfada to
557a24f
Compare
|
Have we confirmed that JET has the required datasets here? I.e. If so, awesome! Then just update with my comments and it looks great |
…e train.py changes Signed-off-by: Jared Wilber <jwilber@nvidia.com>
|
/ok to test 488bf5e |
|
Updated:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1028 +/- ##
==========================================
- Coverage 80.69% 80.59% -0.11%
==========================================
Files 156 157 +1
Lines 11060 11079 +19
==========================================
+ Hits 8925 8929 +4
- Misses 2135 2150 +15
🚀 New features to boost your workflow:
|
Signed-off-by: nvmvle <mvle@nvidia.com>
Signed-off-by: nvmvle <mvle@nvidia.com>
Signed-off-by: nvmvle <mvle@nvidia.com>
|
/ok to test afec57a |
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: nvmvle <mvle@nvidia.com>
sub-packages/bionemo-evo2/src/bionemo/evo2/utils/logging/callbacks.py
Outdated
Show resolved
Hide resolved
WalkthroughAdds a new Evo2 finetuning benchmark config, updates a placeholder in the pretrain benchmark URL, introduces a GarbageCollectAtInferenceTime Lightning callback, and wires a new --garbage-collect-at-inference flag in the training script to optionally run CUDA/GC cleanup at validation start. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor CI as CI / User
participant CLI as train.py
participant Trainer as Lightning Trainer
participant CB as GarbageCollectAtInferenceTime
participant CUDA as CUDA & Python GC
participant Val as Validation Loop
CI->>CLI: invoke train_${model} [--garbage-collect-at-inference]
CLI->>Trainer: init Trainer (callbacks [..., CB?])
alt garbage-collect flag enabled
Trainer->>CB: on_validation_start()
CB->>CUDA: empty_cache(), sync, set_device, sync, gc.collect()
CUDA-->>CB: cleanup complete
else flag disabled
Trainer-->>Val: proceed to validation without extra cleanup
end
Trainer->>Val: start validation
Val-->>Trainer: return metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
sub-packages/bionemo-evo2/src/bionemo/evo2/utils/callbacks.py (1)
25-36: Tighten CUDA/GC cleanup ordering and logging.
- Run gc.collect() before empty_cache() so Python frees tensors first.
- Dropping set_device(current_device) is safe/redundant here.
- Use a logger instead of print, and optionally log only on rank 0 to avoid spam.
- def on_validation_start(self, trainer, pl_module) -> None: + def on_validation_start(self, trainer, pl_module) -> None: """Clean up CUDA memory before validation to prevent initialization errors.""" if torch.cuda.is_available(): try: - torch.cuda.empty_cache() - torch.cuda.synchronize() - current_device = torch.cuda.current_device() - torch.cuda.set_device(current_device) - torch.cuda.synchronize() - gc.collect() + gc.collect() + torch.cuda.empty_cache() + torch.cuda.synchronize() except Exception as e: - print(f"Warning: CUDA cleanup failed: {e}") + logger = getattr(pl_module, "log", None) + msg = f"CUDA cleanup failed: {e}" + if logger is not None: + pl_module.log("gc_warning", msg, on_step=True, on_epoch=False, prog_bar=False, logger=True) + else: + import logging; logging.getLogger(__name__).warning(msg)Additions outside the selected range:
# at top-level imports import logging logger = logging.getLogger(__name__)sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py (1)
510-515: Flag name and help are clear; consider interplay with existing GC callback.You already support nl_callbacks.GarbageCollectionCallback via --gc-interval. Document that both can be used together (GPU vs CPU GC) and which one to prefer in typical FP8 runs.
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)
47-47: Align max_steps and early-stop to avoid confusion.early-stop-on-step overrides max_steps in train.py. With max_steps=10 but stop_steps=200, the run will target 200 steps. If you intend a short smoke test, set stop_steps to 10 (or drop max_steps).
- max_steps: 10 + max_steps: 10 @@ - --early-stop-on-step=${stop_steps} \ + --early-stop-on-step=10 \Also applies to: 97-97
55-56: Note: precision key is metadata-only.precision: fp8 isn’t consumed by the script; fp8 is toggled via --fp8/--fp8-wgrad in train.py. If you need FP8 here, add the flags; otherwise keep precision solely for grouping.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
ci/benchmarks/partial-conv/evo2_finetuning.yaml(1 hunks)ci/benchmarks/partial-conv/evo2_pretrain.yaml(1 hunks)sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py(3 hunks)sub-packages/bionemo-evo2/src/bionemo/evo2/utils/callbacks.py(1 hunks)sub-packages/bionemo-evo2/src/bionemo/evo2/utils/logging/callbacks.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py (1)
sub-packages/bionemo-evo2/src/bionemo/evo2/utils/callbacks.py (1)
GarbageCollectAtInferenceTime(22-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (5)
ci/benchmarks/partial-conv/evo2_pretrain.yaml (1)
18-18: Verify JET token interpolation inside script_args value.Switching to ${{JET_GITLAB_TOKEN}} assumes JET expands variables within script_args before shell execution. Please confirm this resolves to a bare token at runtime; otherwise pip will receive the literal braces and auth will fail. If uncertain, prefer env expansion directly in the script (e.g., set artefacts_url from $JET_GITLAB_TOKEN in-shell) to avoid templating ambiguities and reduce secret exposure in logs.
sub-packages/bionemo-evo2/src/bionemo/evo2/utils/logging/callbacks.py (1)
46-46: LGTM on spacing.The added blank line after the docstring improves readability; no functional changes.
sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py (2)
51-51: LGTM: callback import is scoped and specific.Importing GarbageCollectAtInferenceTime here keeps the training entrypoint self-contained.
655-656: LGTM: conditional registration of cleanup callback.Hooking it right after TEVCallback keeps ordering predictable and isolated from LoRA transforms.
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)
58-65: Matrix overrides look correct; verify empty string expansion.Confirm that lora_enabled: "" renders to nothing (no stray spaces) while "--lora-finetune" is injected for the second product. If templating preserves an extra space, place the flag at the end of the line or gate with a conditional.
|
/ok to test 7a828cf |
|
/ok to test 9ec1f62 |
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (1)
88-88: Quote glob to prevent shell expansion.SDH* will be expanded by the shell.
- --hybrid-override-pattern SDH* \ + --hybrid-override-pattern 'SDH*' \
🧹 Nitpick comments (2)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (2)
30-30: Unused arg: workspace.Not referenced in the script. Either use it (e.g., for result-dir) or drop to avoid drift.
31-41: Path assumptions: confirm JET availability.Ensure /data/evo2/{preprocessed_data,checkpoints/nemo2_evo2_1b_8k,training_data_config.yaml} exist on the JET runners used by this scope.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
ci/benchmarks/partial-conv/evo2_finetuning.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (3)
ci/benchmarks/partial-conv/evo2_finetuning.yaml (3)
3-26: LGTM on key_segments defaults.Reasonable exclusions to keep run IDs concise.
57-66: Products overlay looks fine.Variant and LoRA flag override pattern is clear.
98-98: Ignore --garbage-collect-at-inference concern. The flag is defined in sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py (parser.add_argument at line 511) and applied via args.garbage_collect_at_inference (line 655), so the YAML entry is valid as-is.Likely an incorrect or invalid review comment.
|
/ok to test 807f311 |
|
/ok to test d3b59bd |
Description
This PR adds a benchmarking configuration for Evo2 fine-tuning with partial convolution support. The changes include:
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Usage
Pre-submit Checklist
| - [x] I have tested these changes locally
| - [ ] I have updated the documentation accordingly
| - [ ] I have added/updated tests as needed
| - [ ] All existing tests pass successfully
Signed-off-by: My Le mvle@nvidia.com
Summary by CodeRabbit
New Features
Chores