Jwilber/debug tflops esm2 by jwilber · Pull Request #1267 · NVIDIA/bionemo-framework

jwilber · 2025-10-21T21:28:07Z

Debug tflops issue by logging seq length

Summary by CodeRabbit

New Features
- Added performance metrics to monitor average sequence lengths (padded and unpadded) during training runs.
Chores
- Standardized micro-batch size configuration across multiple model profiles for consistent performance tuning.

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

copy-pr-bot · 2025-10-21T21:28:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-21T21:28:18Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Two changes enhance performance monitoring and configuration consistency: added sequence length metrics (padded and unpadded) to performance logging, and standardized micro_batch_size configuration across four model product variants in the ESM2 native TE 15B configuration.

Changes

Cohort / File(s)	Summary
Performance metrics tracking `bionemo-recipes/recipes/esm2_native_te/perf_logger.py`	Added two new metrics (`train/avg_sequence_length_unpadded` and `train/avg_sequence_length_padded`) to the performance logger. Updated `log_step` method to compute and record average sequence lengths per batch alongside existing timing and throughput metrics.
Configuration standardization `ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml`	Added `micro_batch_size: 8` to four product configurations (TE bshd perf FSDP2, TE thd perf FSDP2, TE bshd perf MFSDP, TE thd perf MFSDP) to align per-product micro-batch sizing with shared default.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through logs so bright,
Counting sequences left and right,
Padded, unpadded, batch by batch,
Now configurations match and match!
Performance metrics, neat and clean,
The finest training yet we've seen! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description is significantly incomplete relative to the required template. The author provided only a single sentence ("Debug tflops issue by logging seq length") without any of the required sections: a detailed description, usage examples with code snippets, type of change classification, CI pipeline configuration labels, or pre-submit checklist confirmations. While the brief description touches on the PR's purpose, it fails to provide the structured information and verification steps that the template mandates.	Complete the pull request description by following the template structure. Add a detailed description explaining why sequence length logging was added, provide a usage example showing how the new metrics are accessed, select the appropriate type of change (likely "New feature"), configure any required CI labels, and confirm all pre-submit checklist items. This ensures reviewers have the context and information needed to properly evaluate the changes.
Title Check	❓ Inconclusive	The PR title "Jwilber/debug tflops esm2" is related to the pull request's objective of debugging a TFLOPS issue, but it uses relatively vague and non-descriptive terminology that doesn't clearly communicate the actual changes made. The title includes a branch name prefix ("Jwilber/") which is atypical for PR titles, and "debug tflops" lacks specificity about what implementation details changed (e.g., adding sequence length metrics or configuration updates). While the title references a real aspect of the changeset, it doesn't provide meaningful information for a teammate scanning the history to quickly understand what modifications were made.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)
99-103: Sequence length calculations are correct, minor formatting issue.

The logic correctly computes average sequence lengths by dividing token counts by batch size. However, line 103 contains trailing whitespace.

Apply this diff to remove trailing whitespace:
 avg_seq_length_unpadded = num_unpadded_tokens / batch_size
 avg_seq_length_padded = num_tokens / batch_size
-   
 

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9215f90 and 2ad6917.

📒 Files selected for processing (2)

bionemo-recipes/recipes/esm2_native_te/perf_logger.py (3 hunks)
ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (2)

bionemo-recipes/recipes/esm2_native_te/perf_logger.py (2)

64-65: LGTM! New sequence length metrics properly declared.

The new metrics for tracking average sequence lengths (padded and unpadded) are correctly declared using MeanMetric() and follow the existing naming convention.

114-116: LGTM! Metrics are correctly updated.

The new sequence length metrics are properly updated following the same pattern as other metrics in this method.

coderabbitai · 2025-10-21T21:32:47Z

ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml

    task_cmd: train_fsdp2
    parallelism_strategy: fsdp2
    thd_enabled: false
+    micro_batch_size: 8


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify whether explicit micro_batch_size settings are necessary.

These four product configurations now explicitly set micro_batch_size: 8, which matches the shared default at line 64. Depending on how the configuration framework merges defaults, these explicit settings may be redundant.

Also applies to: 97-97, 105-105, 113-113

🏁 Script executed:

#!/bin/bash # Description: Check if other recipe configs in the codebase use explicit micro_batch_size # in products when it matches the default, to understand the pattern. # Search for product configurations with micro_batch_size rg -n -A2 -B2 'products:' ci/lepton/model_convergence/configs/recipes/ | rg -C3 'micro_batch_size'

Length of output: 0

I need to examine the YAML file structure and understand how configuration defaults are handled. Let me run more targeted searches.

🏁 Script executed:

#!/bin/bash # First, let's check if the file exists and examine its structure around the key lines echo "=== Checking YAML file structure ===" if [ -f "ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml" ]; then echo "File found. Examining lines 60-120:" sed -n '60,120p' ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml else echo "File not found at expected path. Searching for similar files:" fd -e yaml "esm2_native_te_15b" fi

Length of output: 2144

🏁 Script executed:

#!/bin/bash # Search for other recipe yaml files to understand the pattern echo "=== Finding all recipe YAML files ===" fd -e yaml . ci/lepton/model_convergence/configs/recipes/ 2>/dev/null | head -20

Length of output: 453

🏁 Script executed:

#!/bin/bash # Look for configuration merging or default handling logic echo "=== Searching for config merging logic ===" rg -n "merge|default|micro_batch_size" ci/lepton/model_convergence/ --type yaml | head -30

Length of output: 2697

I need to understand how this configuration framework handles defaults and overrides. Let me search for the configuration loading code.

🏁 Script executed:

#!/bin/bash # Search for Python files that load or process these YAML configurations echo "=== Searching for config loading/processing code ===" rg -n "micro_batch_size|products" ci/lepton/model_convergence/ --type py -B2 -A2 | head -50

Length of output: 52

🏁 Script executed:

#!/bin/bash # Look for Hydra configuration or configuration merging logic echo "=== Searching for config framework setup ===" rg -n "hydra|OmegaConf|load_config|merge" ci/lepton/ --type py | head -30

Length of output: 1650

🏁 Script executed:

#!/bin/bash # Check if there's documentation or comments explaining product config behavior echo "=== Checking for config documentation ===" grep -r "product" ci/lepton/model_convergence/configs/ --include="*.yaml" --include="*.md" -B1 -A1 | head -40

Length of output: 3765

These explicit micro_batch_size settings are indeed redundant and can be safely removed.

The configuration framework uses a simple dict merge where product values override base. Since all four products (lines 89, 97, 105, 113) explicitly set micro_batch_size: 8, which matches the global default at line 64, these settings have no functional effect—removing them would preserve identical behavior. The fifth product (line 123) with micro_batch_size: 4 correctly differs and should remain.

While similar patterns exist in other recipe configs (suggesting possible consistency convention), the explicit values serve no functional purpose given how the merge logic operates.

🤖 Prompt for AI Agents

ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml lines 89-89: the explicit micro_batch_size: 8 in these product blocks is redundant because the global default already sets micro_batch_size: 8 and product values only override base, so remove the explicit micro_batch_size: 8 entries from the four product definitions (lines 89, 97, 105, 113) while leaving the differing micro_batch_size: 4 at line 123 intact; ensure indentation and YAML structure remain valid after removal so the merge behavior and semantics are unchanged.

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

jwilber added 2 commits October 21, 2025 14:27

add seq size to logging for debug

e2408c0

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

force mbs

2ad6917

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

jwilber requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, pstjohn and trvachov as code owners October 21, 2025 21:28

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

jwilber added 6 commits October 23, 2025 10:10

add config whoops:

959db96

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

fix wandb args

51dec12

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

comment out successful configs to trigger from gh action

148e5f2

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

comment out successful configs to trigger from gh action

9c1f24b

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

comment out successful configs to trigger from gh action

bfefd43

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

comment out successful configs to trigger from gh action

33231a0

Signed-off-by: Jared Wilber <jwilber@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jwilber/debug tflops esm2#1267

Jwilber/debug tflops esm2#1267
jwilber wants to merge 8 commits intomainfrom
jwilber/debug-tflops-esm2

jwilber commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

Review skipped

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jwilber commented Oct 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jwilber commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading