Add NVFP4 global-scale (g_amax) study under experimental/ by cjluo-nv · Pull Request #1573 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-30T21:39:50Z

What

Adds a self-contained numerical study under experimental/nvfp4_global_scale_study/ of how the NVFP4 per-tensor global scale (g_amax) affects quantize/dequantize error, and how to calibrate it — including a recipe for activation scaling that is robust to unseen inference dynamic range.

nvfp4_global_scale_study.py — drives the real NVFP4QTensor quantize/dequantize path (not a re-implementation) and cross-checks it against the closed-form math. Part 1 asserts ALL real==manual: True over 15 scenarios; Parts 2–3 regenerate the two figures.
README.md — the report, with both figures embedded.
error_vs_gamax.png, error_vs_ratio.png — generated figures.

Key findings

g_amax does not set element resolution (the per-block FP8 scale already normalizes each block to e2m1 [-6, 6]); it only decides which blocks fall out of the well-conditioned "normal FP8" zone — a range-only, second-order decision.
Closed-form well-conditioned window: B_max ≤ g_amax ≤ 28672·B_min (width always ≈4.46 decades, slides with b_amax).
Three regimes: saturation (ρ<1), well-conditioned (1≤ρ≤28672), subnormal/underflow (ρ>28672), where ρ = g_amax/b_amax.
e2m1 dead zone: any element with |e| < b_amax/24 is annihilated — intrinsic to NVFP4, independent of g_amax.
Activation calibration recipe: bias g_amax upward within [B_max, 28672·B_min] (saturation is catastrophic, subnormal is graceful), e.g. g_amax ≈ B_max·slack^0.65.

Notes

Lives under experimental/ as a research/educational study (per that directory's README); no changes to production modelopt code.
Reproduce with: python experimental/nvfp4_global_scale_study/nvfp4_global_scale_study.py

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-30T21:39:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-30T21:39:57Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 588ac5c6-78ef-42ff-a25e-28c55470d1b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/nvfp4-global-scale-study

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-30T22:00:36Z

Codecov Report

❌ Patch coverage is 27.06767% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.93%. Comparing base (f21977a) to head (c418e9b).
⚠️ Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/calib/nvfp4_act_max.py	16.85%	74 Missing ⚠️
modelopt/torch/quantization/model_calib.py	23.33%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1573      +/-   ##
==========================================
- Coverage   77.41%   75.93%   -1.48%     
==========================================
  Files         480      489       +9     
  Lines       52506    55514    +3008     
==========================================
+ Hits        40645    42156    +1511     
- Misses      11861    13358    +1497

Flag	Coverage Δ
unit	`53.92% <27.06%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

A self-contained numerical study of how the NVFP4 per-tensor global scale (g_amax) affects quantize/dequantize error, and how to calibrate it. - nvfp4_global_scale_study.py drives the real NVFP4QTensor code path and cross-checks it against the closed-form math (Part 1: ALL real==manual). - Derives the three g_amax regimes (saturation / well-conditioned / subnormal) and the closed-form well-conditioned window [B_max, 28672*B_min]. - Documents the e2m1 grid dead zone (|e| < b_amax/24) and an activation g_amax calibration recipe robust to unseen inference dynamic range. - README.md report with both generated figures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

The relative FP8 block-scale error (fp8(bscale)-bscale)/bscale depends only on t = b_amax/g_amax, so a single curve exposes all four regimes (saturation / normal / subnormal / lower-clamp). Replaces the e2m1-grid ratio plot; the e2m1 dead-zone finding is retained as text in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Calibration rarely bounds the inference B_max (outlier-driven, heavy-tailed), but B_min (normalization-governed bulk floor) is stable. Anchoring the normal window's bottom edge at B_min (g_amax = rho * B_min, rho ~16384) hands the format its full dynamic range as outlier insurance without predicting B_max. Part 4 simulates outlier blocks growing by factor k at inference and shows the B_min-anchored choice tracks the oracle while B_max-anchored saturates and degrades 40x+ (calib_strategy.png). README updated with the strategy, guardrails and the verification table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

28672 = FP8-E4M3 normal dynamic range = max_normal/min_normal = 448 / 2^-6, which is the width of the well-conditioned g_amax window. Adds the full derivation (block-scale range -> g_amax window -> E4M3FN bit-layout landmarks) and the sibling 229376 = 448 / 2^-9 (full range incl. subnormals). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Max calibration for weights; B_min-anchored global-scale calibration for NVFP4 activation (input) quantizers: g_amax = rho * B_min (rho < 28672), spending the NVFP4 normal-FP8 window as upward headroom against unseen activation outliers instead of sitting on the saturation cliff like plain max. - calib/nvfp4_act_max.py: NVFP4ActMaxCalibrator (log2 block-amax histogram -> robust B_min/B_max percentiles -> g_amax with sanity floor + range guardrail; optional per-quantizer stats dump via NVFP4_ACT_MAX_STATS_PATH) - model_calib.py: nvfp4_act_max_calibrate (swaps the calibrator onto NVFP4 input quantizers, then runs max_calibrate in a single pass) - config.py: NVFP4ActMaxCalibConfig (rho, b_min_percentile, b_max_percentile, margin) - mode.py: NVFP4ActMaxCalibrateModeDescriptor - modelopt_recipes/general/ptq: nvfp4_mlp_only-kv_fp8_cast and nvfp4_act_max_mlp_only-kv_fp8_cast recipes - experimental/nvfp4_global_scale_study: design doc + comparison/analysis scripts Pre-commit hooks were run manually on these files (all pass); the commit hook is skipped only because its autostash cannot run against unrelated read-only .claude/skills changes in this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv · 2026-06-05T18:08:38Z

AA-Index eval — NVFP4 Nemotron-3-Nano-30B-A3B `g_amax` calibration variants

Empirical accuracy check for the NVFP4 checkpoints quantized with the g_amax calibration recipes from this study, vs the reference NVFP4 checkpoint. All five run the AA-style suite (GPQA Diamond AA-v3, SciCode, AA-LCR).

Setup: vLLM 0.19.1, DP=8 + expert-parallel on 1×(8× B200); --reasoning-parser nano_v3, --kv-cache-dtype fp8, FlashInfer FP4 MoE; T=1.0, top_p=1.0, max_new_tokens=131072. GPQA n_samples=64; AA-LCR num_repeats=64 (judge: Qwen3-235B); SciCode = mean of 4×(num_repeats=8) — see note.

Checkpoint	GPQA ±se	SciCode ±se	AA-LCR (judge)	MLflow
nvfp4 (ref)	72.49 ±0.43	30.85 ±0.24	32.84	exp 1375
ref-code	72.53 ±0.39	32.42 ±0.13	33.55	exp 1376
ref-max	72.96 ±0.39	32.23 ±0.48	33.39	exp 1377
ref-reasoning	72.59 ±0.40	32.27 ±0.40	34.44	exp 1378
act-max	73.14 ±0.42	32.24 ±0.49	34.38	exp 1379

Takeaways

GPQA — flat (72.5–73.1, within ±se): lossless.
SciCode — variants ≥ ref (32.2–32.4 vs ~30.9).
AA-LCR — variants > ref (33.4–34.4 vs 32.8); ref-reasoning / act-max strongest.
→ all four calibration variants are at parity-or-better than the reference NVFP4 checkpoint; no accuracy regression.

Notes

SciCode was run as 4×8 repeats (averaged) rather than 1×64: the code-execution sandbox degrades at high request volume (a single 64-repeat run scored a spurious ~10 with healthy generation but mass execution failures). Each 8-repeat run is clean.
SciCode has real run-to-run variance at T=1.0. ref SciCode is the mean of 4 representative runs (30.2–31.4); two additional ref runs scored ~23 (clean sandbox, healthy generation — low-variance tail) and are excluded; including all 6 gives 28.3 ±1.6.
Reasoning traces were detected on ≈0 of responses (nano_v3 parser) → these are effectively non-reasoning-mode scores; absolute numbers would shift if reasoning mode is enabled (consistent across all 5, so the relative comparison holds).

🤖 Generated with Claude Code

compare_input_scales.py now plots/tabulates per-layer NVFP4 activation g_amax (input_scale*6*448) across an arbitrary set of checkpoints over all decoder layers. Adds the generated report comparing ref-max, act-max (b_min_percentile=5), and the code- and reasoning-calibrated reference checkpoints. Pre-commit hooks were run manually on these files (all pass); the commit hook is skipped only because its autostash cannot run against unrelated read-only .claude/skills changes in this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

github-actions · 2026-06-05T18:21:37Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1573/
Built to branch `gh-pages` at 2026-06-05 18:21 UTC. Preview will be ready when the GitHub Pages deployment is complete.

cjluo-nv and others added 4 commits June 2, 2026 21:49

cjluo-nv force-pushed the chenjiel/nvfp4-global-scale-study branch from d64268d to 15c8cea Compare June 2, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVFP4 global-scale (g_amax) study under experimental/#1573

Add NVFP4 global-scale (g_amax) study under experimental/#1573
cjluo-nv wants to merge 6 commits into
mainfrom
chenjiel/nvfp4-global-scale-study

cjluo-nv commented May 30, 2026

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 30, 2026 •

edited

Loading

Uh oh!

cjluo-nv commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Built to branch `gh-pages` at 2026-06-05 18:21 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 30, 2026

What

Contents

Key findings

Notes

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv commented Jun 5, 2026

AA-Index eval — NVFP4 Nemotron-3-Nano-30B-A3B g_amax calibration variants

Uh oh!

github-actions Bot commented Jun 5, 2026

Built to branch gh-pages at 2026-06-05 18:21 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading

AA-Index eval — NVFP4 Nemotron-3-Nano-30B-A3B `g_amax` calibration variants

Built to branch `gh-pages` at 2026-06-05 18:21 UTC.
Preview will be ready when the GitHub Pages deployment is complete.