Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #2119

sbhavani · 2025-11-04T00:31:19Z

This PR fixes the same issue as #1755 - copying it to run CI against it. Previously was in "Final Review".

Problem

The straggler detector's Etpt (estimated throughput) metric was growing linearly with iteration count, reaching unrealistic values like 147,000+ TF/s (indicating impossible 147000%+ MFU).

Root Cause

In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0 but this reset was local to the function and not persisted back to the main training loop. The caller continued using the growing accumulator.

Solution

Modified post_training_step_callbacks() to return the updated FLOPs counter
Updated the call site to capture and use the returned (reset) counter
Added clear comments explaining the reset behavior

This ensures the straggler detector gets accurate per-interval FLOPs measurements instead of cumulative values.

Co-authored-by: Li Ruixiao [email protected]

Co-authored-by: Chen-Han Yu <[email protected]>

…acts

Signed-off-by: oliver könig <[email protected]>

Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

…upport Co-authored-by: jianbinc <[email protected]> Co-authored-by: xuwenc <[email protected]>

Signed-off-by: oliver könig <[email protected]>

Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: oliver könig <[email protected]>

Co-authored-by: oliver könig <[email protected]>

Signed-off-by: oliver könig <[email protected]>

Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]>

Co-authored-by: oliver könig <[email protected]>

…#2001)" This reverts commit a652e2c.

Signed-off-by: Charlie Truong <[email protected]>

Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]>

Signed-off-by: oliver könig <[email protected]>

This reverts commit 7487c53.

Signed-off-by: oliver könig <[email protected]>

The straggler detector's Etpt (estimated throughput) metric was growing linearly with iteration count, reaching unrealistic values like 147,000+ TF/s. Root Cause: In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0 but this reset was local to the function and not persisted back to the main training loop. The caller continued using the growing accumulator. Solution: - Modified post_training_step_callbacks() to return the updated FLOPs counter - Updated the call site to capture and use the returned (reset) counter - Added clear comments explaining the reset behavior This ensures the straggler detector gets accurate per-interval FLOPs measurements instead of cumulative values. Co-authored-by: Li Ruixiao <[email protected]> Signed-off-by: Santosh Bhavani <[email protected]>

copy-pr-bot · 2025-11-04T00:31:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yueshen2016 and others added 30 commits October 21, 2025 06:35

ADLR/megatron-lm!4169 - [OMNIML-2921] GPT-OSS Modelopt support

a2d8c80

Co-authored-by: Chen-Han Yu <[email protected]>

ADLR/megatron-lm!4298 - ci: Refactor testsytem - Removal of JET Artif…

adc69db

…acts

build: Upgrade jet-client

5814a00

Signed-off-by: oliver könig <[email protected]>

build: Upgrade JET

bacc707

Signed-off-by: oliver könig <[email protected]>

ADLR/megatron-lm!4308 - build: Bump TE

754dfa2

ADLR/megatron-lm!4286 - Add cpu offloading interface

3106714

ADLR/megatron-lm!4170 - chore: delete utils_object_storage.py

edcbc17

ADLR/megatron-lm!4316 - ci: Temporarily block external contributions

9cb2518

ADLR/megatron-lm!4272 - Track and cleanup NSys NVTX context

c0a595d

ADLR/megatron-lm!3955 - Megatron-FSDP Expert Parallel (DeepSeek-v3) S…

268fda0

…upport Co-authored-by: jianbinc <[email protected]> Co-authored-by: xuwenc <[email protected]>

ci: Add copyright checker for GitHub CI

a3a1f06

Signed-off-by: oliver könig <[email protected]>

ci: Fix copyright checker (NVIDIA#1889)

f82223f

Signed-off-by: oliver könig <[email protected]>

ci: Run on dev

6c57be9

Signed-off-by: oliver könig <[email protected]>

ci: Fix linter

e7106d2

Signed-off-by: oliver könig <[email protected]>

ci: Fix copyright checker (NVIDIA#1893)

4ddd50d

Signed-off-by: oliver könig <[email protected]>

ci: Linting on main

2a01637

Signed-off-by: oliver könig <[email protected]>

ci(fix): HAS_RUN_TESTS_LABEL

218b0e0

Signed-off-by: oliver könig <[email protected]>

ci: Fix linting

e0b3d5b

Signed-off-by: oliver könig <[email protected]>

ci(fix): Do not run linting on push

3364dba

Signed-off-by: oliver könig <[email protected]>

chore: Add codeowners (NVIDIA#1897)

8325951

Signed-off-by: oliver könig <[email protected]>

chore: Update codeowners

2e38079

Signed-off-by: oliver könig <[email protected]>

ci(fix): No copyright on push

4d14c57

Signed-off-by: oliver könig <[email protected]>

ci: Extend queue-manager for dev branch (NVIDIA#1906)

a350a6e

Signed-off-by: oliver könig <[email protected]>

ci: Move test optimizer into its own bucket (NVIDIA#1909)

adf4247

Signed-off-by: oliver könig <[email protected]>

ci: Use matrix for approval-bot

1edc4d6

Signed-off-by: oliver könig <[email protected]>

ci: Update function name

04e640b

Signed-off-by: oliver könig <[email protected]>

ci: Adjust approval-bot for copy-pr-bot

c7f154f

Signed-off-by: oliver könig <[email protected]>

ci: Parametrize workflow

019084e

Signed-off-by: oliver könig <[email protected]>

ci: Parametrize workflow

aff784e

Signed-off-by: oliver könig <[email protected]>

ko3n1g and others added 24 commits October 31, 2025 14:11

ci(hotfix): Remove performance for ckpt-resume

818e072

Signed-off-by: oliver könig <[email protected]>

Allow inference test throughput to vary by 10% (NVIDIA#2070)

f248fcb

ci(hotfix): Inference test pipeline

e715d2f

Signed-off-by: oliver könig <[email protected]>

chore: Fix autoformatter (NVIDIA#2073)

aad8761

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): Remove iteration-time from t5

e3ae351

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): disable inference test

87cbe76

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): Disable inference test

d0d00b3

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082)

88e3a8a

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): Enable merge-group for approval bot

53305bc

Signed-off-by: oliver könig <[email protected]>

chore: Update local tooling (NVIDIA#2066)

7c16ca0

Signed-off-by: oliver könig <[email protected]>

Add extra RL files (NVIDIA#2077)

dc7a0ca

Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: oliver könig <[email protected]>

Prevent summary jobs from running in forks (NVIDIA#2083)

5cfad7b

Co-authored-by: oliver könig <[email protected]>

ci: Fix test scope (NVIDIA#2091)

ba21b69

Signed-off-by: oliver könig <[email protected]>

ci(hotfix): Remove publish workflows

7ca2890

Signed-off-by: oliver könig <[email protected]>

Refactor the attention metadata into separate classes (NVIDIA#2001)

a652e2c

Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]>

Guard against incorrectly using MoE prefill graphs (NVIDIA#2030)

65cd27c

Co-authored-by: oliver könig <[email protected]>

Revert "Refactor the attention metadata into separate classes (NVIDIA…

d3f1af4

…#2001)" This reverts commit a652e2c.

Run mr-slim tests in lightweight-mode (NVIDIA#2106)

5671e3a

Signed-off-by: Charlie Truong <[email protected]>

Inference | Lazy compile UVM allocator. (NVIDIA#1977)

7487c53

Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]>

chore: Reenable trustees (NVIDIA#2108)

1307f87

Signed-off-by: oliver könig <[email protected]>

Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)"

282b74c

This reverts commit 7487c53.

ci(fix): Changeset of copyright checker (NVIDIA#2110)

2cab46f

Signed-off-by: oliver könig <[email protected]>

Ko3n1g/chore/update release settings (NVIDIA#2097)

d4194b7

Signed-off-by: oliver könig <[email protected]>

sbhavani requested review from a team as code owners November 4, 2025 00:31

sbhavani added bug Something isn't working Expert Review Apply this label to indicate that your PR is ready for expert review. labels Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #2119

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #2119

Uh oh!

sbhavani commented Nov 4, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #2119

Are you sure you want to change the base?

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #2119

Uh oh!

Conversation

sbhavani commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Uh oh!

copy-pr-bot bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

sbhavani commented Nov 4, 2025 •

edited

Loading