Skip to content

Add NVFP4 global-scale (g_amax) study under experimental/#1573

Draft
cjluo-nv wants to merge 6 commits into
mainfrom
chenjiel/nvfp4-global-scale-study
Draft

Add NVFP4 global-scale (g_amax) study under experimental/#1573
cjluo-nv wants to merge 6 commits into
mainfrom
chenjiel/nvfp4-global-scale-study

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

What

Adds a self-contained numerical study under experimental/nvfp4_global_scale_study/ of how the NVFP4 per-tensor global scale (g_amax) affects quantize/dequantize error, and how to calibrate it — including a recipe for activation scaling that is robust to unseen inference dynamic range.

Contents

  • nvfp4_global_scale_study.py — drives the real NVFP4QTensor quantize/dequantize path (not a re-implementation) and cross-checks it against the closed-form math. Part 1 asserts ALL real==manual: True over 15 scenarios; Parts 2–3 regenerate the two figures.
  • README.md — the report, with both figures embedded.
  • error_vs_gamax.png, error_vs_ratio.png — generated figures.

Key findings

  • g_amax does not set element resolution (the per-block FP8 scale already normalizes each block to e2m1 [-6, 6]); it only decides which blocks fall out of the well-conditioned "normal FP8" zone — a range-only, second-order decision.
  • Closed-form well-conditioned window: B_max ≤ g_amax ≤ 28672·B_min (width always ≈4.46 decades, slides with b_amax).
  • Three regimes: saturation (ρ<1), well-conditioned (1≤ρ≤28672), subnormal/underflow (ρ>28672), where ρ = g_amax/b_amax.
  • e2m1 dead zone: any element with |e| < b_amax/24 is annihilated — intrinsic to NVFP4, independent of g_amax.
  • Activation calibration recipe: bias g_amax upward within [B_max, 28672·B_min] (saturation is catastrophic, subnormal is graceful), e.g. g_amax ≈ B_max·slack^0.65.

Notes

  • Lives under experimental/ as a research/educational study (per that directory's README); no changes to production modelopt code.
  • Reproduce with: python experimental/nvfp4_global_scale_study/nvfp4_global_scale_study.py

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 588ac5c6-78ef-42ff-a25e-28c55470d1b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/nvfp4-global-scale-study

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

❌ Patch coverage is 27.06767% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.93%. Comparing base (f21977a) to head (c418e9b).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/calib/nvfp4_act_max.py 16.85% 74 Missing ⚠️
modelopt/torch/quantization/model_calib.py 23.33% 23 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1573      +/-   ##
==========================================
- Coverage   77.41%   75.93%   -1.48%     
==========================================
  Files         480      489       +9     
  Lines       52506    55514    +3008     
==========================================
+ Hits        40645    42156    +1511     
- Misses      11861    13358    +1497     
Flag Coverage Δ
unit 53.92% <27.06%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv and others added 4 commits June 2, 2026 21:49
A self-contained numerical study of how the NVFP4 per-tensor global scale
(g_amax) affects quantize/dequantize error, and how to calibrate it.

- nvfp4_global_scale_study.py drives the real NVFP4QTensor code path and
  cross-checks it against the closed-form math (Part 1: ALL real==manual).
- Derives the three g_amax regimes (saturation / well-conditioned / subnormal)
  and the closed-form well-conditioned window [B_max, 28672*B_min].
- Documents the e2m1 grid dead zone (|e| < b_amax/24) and an activation
  g_amax calibration recipe robust to unseen inference dynamic range.
- README.md report with both generated figures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
The relative FP8 block-scale error (fp8(bscale)-bscale)/bscale depends only on
t = b_amax/g_amax, so a single curve exposes all four regimes (saturation /
normal / subnormal / lower-clamp). Replaces the e2m1-grid ratio plot; the e2m1
dead-zone finding is retained as text in the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Calibration rarely bounds the inference B_max (outlier-driven, heavy-tailed),
but B_min (normalization-governed bulk floor) is stable. Anchoring the normal
window's bottom edge at B_min (g_amax = rho * B_min, rho ~16384) hands the
format its full dynamic range as outlier insurance without predicting B_max.

Part 4 simulates outlier blocks growing by factor k at inference and shows the
B_min-anchored choice tracks the oracle while B_max-anchored saturates and
degrades 40x+ (calib_strategy.png). README updated with the strategy, guardrails
and the verification table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
28672 = FP8-E4M3 normal dynamic range = max_normal/min_normal = 448 / 2^-6,
which is the width of the well-conditioned g_amax window. Adds the full
derivation (block-scale range -> g_amax window -> E4M3FN bit-layout landmarks)
and the sibling 229376 = 448 / 2^-9 (full range incl. subnormals).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv force-pushed the chenjiel/nvfp4-global-scale-study branch from d64268d to 15c8cea Compare June 2, 2026 21:50
Max calibration for weights; B_min-anchored global-scale calibration for NVFP4
activation (input) quantizers: g_amax = rho * B_min (rho < 28672), spending the
NVFP4 normal-FP8 window as upward headroom against unseen activation outliers
instead of sitting on the saturation cliff like plain max.

- calib/nvfp4_act_max.py: NVFP4ActMaxCalibrator (log2 block-amax histogram ->
  robust B_min/B_max percentiles -> g_amax with sanity floor + range guardrail;
  optional per-quantizer stats dump via NVFP4_ACT_MAX_STATS_PATH)
- model_calib.py: nvfp4_act_max_calibrate (swaps the calibrator onto NVFP4 input
  quantizers, then runs max_calibrate in a single pass)
- config.py: NVFP4ActMaxCalibConfig (rho, b_min_percentile, b_max_percentile, margin)
- mode.py: NVFP4ActMaxCalibrateModeDescriptor
- modelopt_recipes/general/ptq: nvfp4_mlp_only-kv_fp8_cast and
  nvfp4_act_max_mlp_only-kv_fp8_cast recipes
- experimental/nvfp4_global_scale_study: design doc + comparison/analysis scripts

Pre-commit hooks were run manually on these files (all pass); the commit hook is
skipped only because its autostash cannot run against unrelated read-only
.claude/skills changes in this environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv
Copy link
Copy Markdown
Collaborator Author

cjluo-nv commented Jun 5, 2026

AA-Index eval — NVFP4 Nemotron-3-Nano-30B-A3B g_amax calibration variants

Empirical accuracy check for the NVFP4 checkpoints quantized with the g_amax calibration recipes from this study, vs the reference NVFP4 checkpoint. All five run the AA-style suite (GPQA Diamond AA-v3, SciCode, AA-LCR).

Setup: vLLM 0.19.1, DP=8 + expert-parallel on 1×(8× B200); --reasoning-parser nano_v3, --kv-cache-dtype fp8, FlashInfer FP4 MoE; T=1.0, top_p=1.0, max_new_tokens=131072. GPQA n_samples=64; AA-LCR num_repeats=64 (judge: Qwen3-235B); SciCode = mean of 4×(num_repeats=8) — see note.

Checkpoint GPQA ±se SciCode ±se AA-LCR (judge) MLflow
nvfp4 (ref) 72.49 ±0.43 30.85 ±0.24 32.84 exp 1375
ref-code 72.53 ±0.39 32.42 ±0.13 33.55 exp 1376
ref-max 72.96 ±0.39 32.23 ±0.48 33.39 exp 1377
ref-reasoning 72.59 ±0.40 32.27 ±0.40 34.44 exp 1378
act-max 73.14 ±0.42 32.24 ±0.49 34.38 exp 1379

Takeaways

  • GPQA — flat (72.5–73.1, within ±se): lossless.
  • SciCode — variants ≥ ref (32.2–32.4 vs ~30.9).
  • AA-LCR — variants > ref (33.4–34.4 vs 32.8); ref-reasoning / act-max strongest.
  • → all four calibration variants are at parity-or-better than the reference NVFP4 checkpoint; no accuracy regression.

Notes

  • SciCode was run as 4×8 repeats (averaged) rather than 1×64: the code-execution sandbox degrades at high request volume (a single 64-repeat run scored a spurious ~10 with healthy generation but mass execution failures). Each 8-repeat run is clean.
  • SciCode has real run-to-run variance at T=1.0. ref SciCode is the mean of 4 representative runs (30.2–31.4); two additional ref runs scored ~23 (clean sandbox, healthy generation — low-variance tail) and are excluded; including all 6 gives 28.3 ±1.6.
  • Reasoning traces were detected on ≈0 of responses (nano_v3 parser) → these are effectively non-reasoning-mode scores; absolute numbers would shift if reasoning mode is enabled (consistent across all 5, so the relative comparison holds).

🤖 Generated with Claude Code

compare_input_scales.py now plots/tabulates per-layer NVFP4 activation g_amax
(input_scale*6*448) across an arbitrary set of checkpoints over all decoder layers.
Adds the generated report comparing ref-max, act-max (b_min_percentile=5), and the
code- and reasoning-calibrated reference checkpoints.

Pre-commit hooks were run manually on these files (all pass); the commit hook is
skipped only because its autostash cannot run against unrelated read-only
.claude/skills changes in this environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1573/

Built to branch gh-pages at 2026-06-05 18:21 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant