Skip to content

Jwilber/debug tflops esm2#1267

Open
jwilber wants to merge 8 commits intomainfrom
jwilber/debug-tflops-esm2
Open

Jwilber/debug tflops esm2#1267
jwilber wants to merge 8 commits intomainfrom
jwilber/debug-tflops-esm2

Conversation

@jwilber
Copy link
Collaborator

@jwilber jwilber commented Oct 21, 2025

Debug tflops issue by logging seq length

Summary by CodeRabbit

  • New Features

    • Added performance metrics to monitor average sequence lengths (padded and unpadded) during training runs.
  • Chores

    • Standardized micro-batch size configuration across multiple model profiles for consistent performance tuning.

Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Two changes enhance performance monitoring and configuration consistency: added sequence length metrics (padded and unpadded) to performance logging, and standardized micro_batch_size configuration across four model product variants in the ESM2 native TE 15B configuration.

Changes

Cohort / File(s) Summary
Performance metrics tracking
bionemo-recipes/recipes/esm2_native_te/perf_logger.py
Added two new metrics (train/avg_sequence_length_unpadded and train/avg_sequence_length_padded) to the performance logger. Updated log_step method to compute and record average sequence lengths per batch alongside existing timing and throughput metrics.
Configuration standardization
ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml
Added micro_batch_size: 8 to four product configurations (TE bshd perf FSDP2, TE thd perf FSDP2, TE bshd perf MFSDP, TE thd perf MFSDP) to align per-product micro-batch sizing with shared default.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through logs so bright,
Counting sequences left and right,
Padded, unpadded, batch by batch,
Now configurations match and match!
Performance metrics, neat and clean,
The finest training yet we've seen! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is significantly incomplete relative to the required template. The author provided only a single sentence ("Debug tflops issue by logging seq length") without any of the required sections: a detailed description, usage examples with code snippets, type of change classification, CI pipeline configuration labels, or pre-submit checklist confirmations. While the brief description touches on the PR's purpose, it fails to provide the structured information and verification steps that the template mandates. Complete the pull request description by following the template structure. Add a detailed description explaining why sequence length logging was added, provide a usage example showing how the new metrics are accessed, select the appropriate type of change (likely "New feature"), configure any required CI labels, and confirm all pre-submit checklist items. This ensures reviewers have the context and information needed to properly evaluate the changes.
Title Check ❓ Inconclusive The PR title "Jwilber/debug tflops esm2" is related to the pull request's objective of debugging a TFLOPS issue, but it uses relatively vague and non-descriptive terminology that doesn't clearly communicate the actual changes made. The title includes a branch name prefix ("Jwilber/") which is atypical for PR titles, and "debug tflops" lacks specificity about what implementation details changed (e.g., adding sequence length metrics or configuration updates). While the title references a real aspect of the changeset, it doesn't provide meaningful information for a teammate scanning the history to quickly understand what modifications were made.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)

99-103: Sequence length calculations are correct, minor formatting issue.

The logic correctly computes average sequence lengths by dividing token counts by batch size. However, line 103 contains trailing whitespace.

Apply this diff to remove trailing whitespace:

 avg_seq_length_unpadded = num_unpadded_tokens / batch_size
 avg_seq_length_padded = num_tokens / batch_size
-   
 
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9215f90 and 2ad6917.

📒 Files selected for processing (2)
  • bionemo-recipes/recipes/esm2_native_te/perf_logger.py (3 hunks)
  • ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (2)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (2)

64-65: LGTM! New sequence length metrics properly declared.

The new metrics for tracking average sequence lengths (padded and unpadded) are correctly declared using MeanMetric() and follow the existing naming convention.


114-116: LGTM! Metrics are correctly updated.

The new sequence length metrics are properly updated following the same pattern as other metrics in this method.

task_cmd: train_fsdp2
parallelism_strategy: fsdp2
thd_enabled: false
micro_batch_size: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify whether explicit micro_batch_size settings are necessary.

These four product configurations now explicitly set micro_batch_size: 8, which matches the shared default at line 64. Depending on how the configuration framework merges defaults, these explicit settings may be redundant.

Also applies to: 97-97, 105-105, 113-113


🏁 Script executed:

#!/bin/bash
# Description: Check if other recipe configs in the codebase use explicit micro_batch_size 
# in products when it matches the default, to understand the pattern.

# Search for product configurations with micro_batch_size
rg -n -A2 -B2 'products:' ci/lepton/model_convergence/configs/recipes/ | rg -C3 'micro_batch_size'

Length of output: 0


I need to examine the YAML file structure and understand how configuration defaults are handled. Let me run more targeted searches.


🏁 Script executed:

#!/bin/bash
# First, let's check if the file exists and examine its structure around the key lines
echo "=== Checking YAML file structure ==="
if [ -f "ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml" ]; then
  echo "File found. Examining lines 60-120:"
  sed -n '60,120p' ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml
else
  echo "File not found at expected path. Searching for similar files:"
  fd -e yaml "esm2_native_te_15b" 
fi

Length of output: 2144


🏁 Script executed:

#!/bin/bash
# Search for other recipe yaml files to understand the pattern
echo "=== Finding all recipe YAML files ==="
fd -e yaml . ci/lepton/model_convergence/configs/recipes/ 2>/dev/null | head -20

Length of output: 453


🏁 Script executed:

#!/bin/bash
# Look for configuration merging or default handling logic
echo "=== Searching for config merging logic ==="
rg -n "merge|default|micro_batch_size" ci/lepton/model_convergence/ --type yaml | head -30

Length of output: 2697


I need to understand how this configuration framework handles defaults and overrides. Let me search for the configuration loading code.


🏁 Script executed:

#!/bin/bash
# Search for Python files that load or process these YAML configurations
echo "=== Searching for config loading/processing code ==="
rg -n "micro_batch_size|products" ci/lepton/model_convergence/ --type py -B2 -A2 | head -50

Length of output: 52


🏁 Script executed:

#!/bin/bash
# Look for Hydra configuration or configuration merging logic
echo "=== Searching for config framework setup ==="
rg -n "hydra|OmegaConf|load_config|merge" ci/lepton/ --type py | head -30

Length of output: 1650


🏁 Script executed:

#!/bin/bash
# Check if there's documentation or comments explaining product config behavior
echo "=== Checking for config documentation ==="
grep -r "product" ci/lepton/model_convergence/configs/ --include="*.yaml" --include="*.md" -B1 -A1 | head -40

Length of output: 3765


These explicit micro_batch_size settings are indeed redundant and can be safely removed.

The configuration framework uses a simple dict merge where product values override base. Since all four products (lines 89, 97, 105, 113) explicitly set micro_batch_size: 8, which matches the global default at line 64, these settings have no functional effect—removing them would preserve identical behavior. The fifth product (line 123) with micro_batch_size: 4 correctly differs and should remain.

While similar patterns exist in other recipe configs (suggesting possible consistency convention), the explicit values serve no functional purpose given how the merge logic operates.

🤖 Prompt for AI Agents
ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml lines 89-89:
the explicit micro_batch_size: 8 in these product blocks is redundant because
the global default already sets micro_batch_size: 8 and product values only
override base, so remove the explicit micro_batch_size: 8 entries from the four
product definitions (lines 89, 97, 105, 113) while leaving the differing
micro_batch_size: 4 at line 123 intact; ensure indentation and YAML structure
remain valid after removal so the merge behavior and semantics are unchanged.

Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant