Skip to content

Conversation

@sbhavani
Copy link
Contributor

@sbhavani sbhavani commented Nov 4, 2025

This PR fixes the same issue as #1755 - copying it to run CI against it. Previously was in "Final Review".

Problem

The straggler detector's Etpt (estimated throughput) metric was growing linearly with iteration count, reaching unrealistic values like 147,000+ TF/s (indicating impossible 147000%+ MFU).

Root Cause

In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0 but this reset was local to the function and not persisted back to the main training loop. The caller continued using the growing accumulator.

Solution

  • Modified post_training_step_callbacks() to return the updated FLOPs counter
  • Updated the call site to capture and use the returned (reset) counter
  • Added clear comments explaining the reset behavior

This ensures the straggler detector gets accurate per-interval FLOPs measurements instead of cumulative values.

Co-authored-by: Li Ruixiao [email protected]

yueshen2016 and others added 30 commits October 21, 2025 06:35
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
ko3n1g and others added 24 commits October 31, 2025 14:11
Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>
The straggler detector's Etpt (estimated throughput) metric was growing linearly
with iteration count, reaching unrealistic values like 147,000+ TF/s.

Root Cause:
In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0
but this reset was local to the function and not persisted back to the main
training loop. The caller continued using the growing accumulator.

Solution:
- Modified post_training_step_callbacks() to return the updated FLOPs counter
- Updated the call site to capture and use the returned (reset) counter
- Added clear comments explaining the reset behavior

This ensures the straggler detector gets accurate per-interval FLOPs measurements
instead of cumulative values.

Co-authored-by: Li Ruixiao <[email protected]>
Signed-off-by: Santosh Bhavani <[email protected]>
@sbhavani sbhavani requested review from a team as code owners November 4, 2025 00:31
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sbhavani sbhavani added bug Something isn't working Expert Review Apply this label to indicate that your PR is ready for expert review. labels Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Expert Review Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.