Skip to content

Fix FLOPs inconsistency and add using_sparse_model flag#21743

Open
Devesh-Maheshwari wants to merge 10 commits into
Lightning-AI:masterfrom
Devesh-Maheshwari:fix-flop-inconsistency
Open

Fix FLOPs inconsistency and add using_sparse_model flag#21743
Devesh-Maheshwari wants to merge 10 commits into
Lightning-AI:masterfrom
Devesh-Maheshwari:fix-flop-inconsistency

Conversation

@Devesh-Maheshwari
Copy link
Copy Markdown

What does this PR do?

Fixes #21677.

The FLOPs values in _CUDA_FLOPS for Hopper GPUs were inconsistent: h100 sxm and h100 pcie used dense per-GPU peaks, while h100 nvl and h200 sxm1 used the headline "with-sparsity" values from NVIDIA's datasheet. This caused a 2× difference in MFU when switching between GPU types.

Changes

Data fix (standardize Hopper on dense):

  • h100 nvl — replaced sparse values (which were duplicates of h100 sxm's sparse numbers) with the correct H100 NVL dense per-GPU peaks: 30 / 60 / 417.5 / 835.5 / 835.5 / 1670.5 TFLOPS.
  • h200 sxm1 — tensor-core entries now stored as dense halves (494.5 / 989.5 / 989.5 / 1979 TFLOPS) instead of sparse headline values.
  • h100 sxm — rounded entries to exact datasheet ÷ 2 (sub-1% drift only).
  • Sources verified against H100 and H200 datasheets.

API addition (per agreed design in issue thread):

  • Throughput.__init__ now accepts using_sparse_model: Optional[bool] = None and sparse_cuda_acceleration_factor: float = 2.0.
  • None (default): no scaling, emits rank_zero_warn so users explicitly choose. True: scales available_flops by the factor. False: dense default, no warning.
  • Forwarded automatically through ThroughputMonitor via **kwargs.

Tests

Four new tests in tests/tests_fabric/utilities/test_throughput.py:

  • test_throughput_sparse_model_scaling — constructor scales correctly for True / False / None-peak / invalid factor.
  • test_throughput_sparse_model_warning — warning fires only when unset + peak known; silenced by explicit True/False or available_flops=None.
  • test_throughput_sparse_model_mfu — end-to-end: MFU halves when sparse flag enabled with default 2× factor.
  • test_throughput_monitor_sparse_modelThroughputMonitor forwards both kwargs into Throughput.

All existing tests in the file still pass (37 passing + 5 pre-existing xfails unrelated to this change).

Screenshot 2026-05-26 at 12 59 06 PM

@Devesh-Maheshwari Devesh-Maheshwari changed the title git push --set-upstream origin fix-flop-inconsistency Fix Hopper FLOPs inconsistency and add using_sparse_model flag May 26, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87%. Comparing base (932b7e3) to head (19b8e9c).
⚠️ Report is 4 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21743   +/-   ##
=======================================
  Coverage      87%      87%           
=======================================
  Files         270      270           
  Lines       23973    23978    +5     
=======================================
+ Hits        20748    20753    +5     
  Misses       3225     3225           

Copy link
Copy Markdown
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Devesh-Maheshwari, thanks for the great work. Added couple of reviews, please have a look. Thanks :)

Comment thread src/lightning/fabric/utilities/throughput.py Outdated
Comment thread src/lightning/fabric/utilities/throughput.py Outdated
Comment thread src/lightning/fabric/utilities/throughput.py Outdated
Comment thread src/lightning/fabric/utilities/throughput.py Outdated
Comment thread src/lightning/fabric/utilities/throughput.py Outdated
Co-authored-by: Deependu <deependujha21@gmail.com>
Copy link
Copy Markdown
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the great work. :)

@Devesh-Maheshwari Devesh-Maheshwari changed the title Fix Hopper FLOPs inconsistency and add using_sparse_model flag Fix FLOPs inconsistency and add using_sparse_model flag May 27, 2026
Copy link
Copy Markdown
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update changelog file too.

@deependujha
Copy link
Copy Markdown
Collaborator

Thanks @Devesh-Maheshwari, LGTM. :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses inconsistent GPU peak FLOPs values used for MFU calculations (notably for Hopper GPUs) and introduces an explicit API to opt into sparsity-accelerated peak FLOPs when computing MFU in Fabric’s throughput utilities.

Changes:

  • Standardizes Hopper entries in _CUDA_FLOPS to use dense peak FLOPs (and normalizes FLOPs constants to consistent scientific notation).
  • Adds using_sparse_model: Optional[bool] and sparse_cuda_acceleration_factor to Throughput.__init__, with optional scaling of available_flops when sparsity is enabled.
  • Adds tests for sparse scaling/warnings/MFU behavior and updates the Fabric changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/lightning/fabric/utilities/throughput.py Adds sparsity-aware FLOPs scaling/warning to Throughput and corrects/normalizes _CUDA_FLOPS entries (especially Hopper).
tests/tests_fabric/utilities/test_throughput.py Adds tests covering sparse scaling, warning behavior, MFU impact, and ThroughputMonitor kwarg forwarding.
src/lightning/fabric/CHANGELOG.md Documents the new Throughput parameters and _CUDA_FLOPS changes (needs wording adjustment).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/tests_fabric/utilities/test_throughput.py
Comment thread src/lightning/fabric/utilities/throughput.py
Comment thread src/lightning/fabric/CHANGELOG.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FLOPs numbers are inconsistent

4 participants