Conversation
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughTwo changes enhance performance monitoring and configuration consistency: added sequence length metrics (padded and unpadded) to performance logging, and standardized micro_batch_size configuration across four model product variants in the ESM2 native TE 15B configuration. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)
99-103: Sequence length calculations are correct, minor formatting issue.The logic correctly computes average sequence lengths by dividing token counts by batch size. However, line 103 contains trailing whitespace.
Apply this diff to remove trailing whitespace:
avg_seq_length_unpadded = num_unpadded_tokens / batch_size avg_seq_length_padded = num_tokens / batch_size -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py(3 hunks)ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (2)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (2)
64-65: LGTM! New sequence length metrics properly declared.The new metrics for tracking average sequence lengths (padded and unpadded) are correctly declared using
MeanMetric()and follow the existing naming convention.
114-116: LGTM! Metrics are correctly updated.The new sequence length metrics are properly updated following the same pattern as other metrics in this method.
| task_cmd: train_fsdp2 | ||
| parallelism_strategy: fsdp2 | ||
| thd_enabled: false | ||
| micro_batch_size: 8 |
There was a problem hiding this comment.
🧩 Analysis chain
Verify whether explicit micro_batch_size settings are necessary.
These four product configurations now explicitly set micro_batch_size: 8, which matches the shared default at line 64. Depending on how the configuration framework merges defaults, these explicit settings may be redundant.
Also applies to: 97-97, 105-105, 113-113
🏁 Script executed:
#!/bin/bash
# Description: Check if other recipe configs in the codebase use explicit micro_batch_size
# in products when it matches the default, to understand the pattern.
# Search for product configurations with micro_batch_size
rg -n -A2 -B2 'products:' ci/lepton/model_convergence/configs/recipes/ | rg -C3 'micro_batch_size'Length of output: 0
I need to examine the YAML file structure and understand how configuration defaults are handled. Let me run more targeted searches.
🏁 Script executed:
#!/bin/bash
# First, let's check if the file exists and examine its structure around the key lines
echo "=== Checking YAML file structure ==="
if [ -f "ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml" ]; then
echo "File found. Examining lines 60-120:"
sed -n '60,120p' ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml
else
echo "File not found at expected path. Searching for similar files:"
fd -e yaml "esm2_native_te_15b"
fiLength of output: 2144
🏁 Script executed:
#!/bin/bash
# Search for other recipe yaml files to understand the pattern
echo "=== Finding all recipe YAML files ==="
fd -e yaml . ci/lepton/model_convergence/configs/recipes/ 2>/dev/null | head -20Length of output: 453
🏁 Script executed:
#!/bin/bash
# Look for configuration merging or default handling logic
echo "=== Searching for config merging logic ==="
rg -n "merge|default|micro_batch_size" ci/lepton/model_convergence/ --type yaml | head -30Length of output: 2697
I need to understand how this configuration framework handles defaults and overrides. Let me search for the configuration loading code.
🏁 Script executed:
#!/bin/bash
# Search for Python files that load or process these YAML configurations
echo "=== Searching for config loading/processing code ==="
rg -n "micro_batch_size|products" ci/lepton/model_convergence/ --type py -B2 -A2 | head -50Length of output: 52
🏁 Script executed:
#!/bin/bash
# Look for Hydra configuration or configuration merging logic
echo "=== Searching for config framework setup ==="
rg -n "hydra|OmegaConf|load_config|merge" ci/lepton/ --type py | head -30Length of output: 1650
🏁 Script executed:
#!/bin/bash
# Check if there's documentation or comments explaining product config behavior
echo "=== Checking for config documentation ==="
grep -r "product" ci/lepton/model_convergence/configs/ --include="*.yaml" --include="*.md" -B1 -A1 | head -40Length of output: 3765
These explicit micro_batch_size settings are indeed redundant and can be safely removed.
The configuration framework uses a simple dict merge where product values override base. Since all four products (lines 89, 97, 105, 113) explicitly set micro_batch_size: 8, which matches the global default at line 64, these settings have no functional effect—removing them would preserve identical behavior. The fifth product (line 123) with micro_batch_size: 4 correctly differs and should remain.
While similar patterns exist in other recipe configs (suggesting possible consistency convention), the explicit values serve no functional purpose given how the merge logic operates.
🤖 Prompt for AI Agents
ci/lepton/model_convergence/configs/recipes/esm2_native_te_15b.yaml lines 89-89:
the explicit micro_batch_size: 8 in these product blocks is redundant because
the global default already sets micro_batch_size: 8 and product values only
override base, so remove the explicit micro_batch_size: 8 entries from the four
product definitions (lines 89, 97, 105, 113) while leaving the differing
micro_batch_size: 4 at line 123 intact; ensure indentation and YAML structure
remain valid after removal so the merge behavior and semantics are unchanged.
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Debug tflops issue by logging seq length
Summary by CodeRabbit
New Features
Chores