Add NVFP4 global-scale (g_amax) study under experimental/#1573
Add NVFP4 global-scale (g_amax) study under experimental/#1573cjluo-nv wants to merge 6 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1573 +/- ##
==========================================
- Coverage 77.41% 75.93% -1.48%
==========================================
Files 480 489 +9
Lines 52506 55514 +3008
==========================================
+ Hits 40645 42156 +1511
- Misses 11861 13358 +1497
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
A self-contained numerical study of how the NVFP4 per-tensor global scale (g_amax) affects quantize/dequantize error, and how to calibrate it. - nvfp4_global_scale_study.py drives the real NVFP4QTensor code path and cross-checks it against the closed-form math (Part 1: ALL real==manual). - Derives the three g_amax regimes (saturation / well-conditioned / subnormal) and the closed-form well-conditioned window [B_max, 28672*B_min]. - Documents the e2m1 grid dead zone (|e| < b_amax/24) and an activation g_amax calibration recipe robust to unseen inference dynamic range. - README.md report with both generated figures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
The relative FP8 block-scale error (fp8(bscale)-bscale)/bscale depends only on t = b_amax/g_amax, so a single curve exposes all four regimes (saturation / normal / subnormal / lower-clamp). Replaces the e2m1-grid ratio plot; the e2m1 dead-zone finding is retained as text in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Calibration rarely bounds the inference B_max (outlier-driven, heavy-tailed), but B_min (normalization-governed bulk floor) is stable. Anchoring the normal window's bottom edge at B_min (g_amax = rho * B_min, rho ~16384) hands the format its full dynamic range as outlier insurance without predicting B_max. Part 4 simulates outlier blocks growing by factor k at inference and shows the B_min-anchored choice tracks the oracle while B_max-anchored saturates and degrades 40x+ (calib_strategy.png). README updated with the strategy, guardrails and the verification table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
28672 = FP8-E4M3 normal dynamic range = max_normal/min_normal = 448 / 2^-6, which is the width of the well-conditioned g_amax window. Adds the full derivation (block-scale range -> g_amax window -> E4M3FN bit-layout landmarks) and the sibling 229376 = 448 / 2^-9 (full range incl. subnormals). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
d64268d to
15c8cea
Compare
Max calibration for weights; B_min-anchored global-scale calibration for NVFP4 activation (input) quantizers: g_amax = rho * B_min (rho < 28672), spending the NVFP4 normal-FP8 window as upward headroom against unseen activation outliers instead of sitting on the saturation cliff like plain max. - calib/nvfp4_act_max.py: NVFP4ActMaxCalibrator (log2 block-amax histogram -> robust B_min/B_max percentiles -> g_amax with sanity floor + range guardrail; optional per-quantizer stats dump via NVFP4_ACT_MAX_STATS_PATH) - model_calib.py: nvfp4_act_max_calibrate (swaps the calibrator onto NVFP4 input quantizers, then runs max_calibrate in a single pass) - config.py: NVFP4ActMaxCalibConfig (rho, b_min_percentile, b_max_percentile, margin) - mode.py: NVFP4ActMaxCalibrateModeDescriptor - modelopt_recipes/general/ptq: nvfp4_mlp_only-kv_fp8_cast and nvfp4_act_max_mlp_only-kv_fp8_cast recipes - experimental/nvfp4_global_scale_study: design doc + comparison/analysis scripts Pre-commit hooks were run manually on these files (all pass); the commit hook is skipped only because its autostash cannot run against unrelated read-only .claude/skills changes in this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
AA-Index eval — NVFP4 Nemotron-3-Nano-30B-A3B
|
| Checkpoint | GPQA ±se | SciCode ±se | AA-LCR (judge) | MLflow |
|---|---|---|---|---|
| nvfp4 (ref) | 72.49 ±0.43 | 30.85 ±0.24 | 32.84 | exp 1375 |
| ref-code | 72.53 ±0.39 | 32.42 ±0.13 | 33.55 | exp 1376 |
| ref-max | 72.96 ±0.39 | 32.23 ±0.48 | 33.39 | exp 1377 |
| ref-reasoning | 72.59 ±0.40 | 32.27 ±0.40 | 34.44 | exp 1378 |
| act-max | 73.14 ±0.42 | 32.24 ±0.49 | 34.38 | exp 1379 |
Takeaways
- GPQA — flat (72.5–73.1, within ±se): lossless.
- SciCode — variants ≥ ref (32.2–32.4 vs ~30.9).
- AA-LCR — variants > ref (33.4–34.4 vs 32.8);
ref-reasoning/act-maxstrongest. - → all four calibration variants are at parity-or-better than the reference NVFP4 checkpoint; no accuracy regression.
Notes
- SciCode was run as 4×8 repeats (averaged) rather than 1×64: the code-execution sandbox degrades at high request volume (a single 64-repeat run scored a spurious ~10 with healthy generation but mass execution failures). Each 8-repeat run is clean.
- SciCode has real run-to-run variance at T=1.0.
refSciCode is the mean of 4 representative runs (30.2–31.4); two additionalrefruns scored ~23 (clean sandbox, healthy generation — low-variance tail) and are excluded; including all 6 gives 28.3 ±1.6. - Reasoning traces were detected on ≈0 of responses (
nano_v3parser) → these are effectively non-reasoning-mode scores; absolute numbers would shift if reasoning mode is enabled (consistent across all 5, so the relative comparison holds).
🤖 Generated with Claude Code
compare_input_scales.py now plots/tabulates per-layer NVFP4 activation g_amax (input_scale*6*448) across an arbitrary set of checkpoints over all decoder layers. Adds the generated report comparing ref-max, act-max (b_min_percentile=5), and the code- and reasoning-calibrated reference checkpoints. Pre-commit hooks were run manually on these files (all pass); the commit hook is skipped only because its autostash cannot run against unrelated read-only .claude/skills changes in this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
What
Adds a self-contained numerical study under
experimental/nvfp4_global_scale_study/of how the NVFP4 per-tensor global scale (g_amax) affects quantize/dequantize error, and how to calibrate it — including a recipe for activation scaling that is robust to unseen inference dynamic range.Contents
nvfp4_global_scale_study.py— drives the realNVFP4QTensorquantize/dequantize path (not a re-implementation) and cross-checks it against the closed-form math. Part 1 assertsALL real==manual: Trueover 15 scenarios; Parts 2–3 regenerate the two figures.README.md— the report, with both figures embedded.error_vs_gamax.png,error_vs_ratio.png— generated figures.Key findings
g_amaxdoes not set element resolution (the per-block FP8 scale already normalizes each block to e2m1[-6, 6]); it only decides which blocks fall out of the well-conditioned "normal FP8" zone — a range-only, second-order decision.B_max ≤ g_amax ≤ 28672·B_min(width always ≈4.46 decades, slides withb_amax).ρ<1), well-conditioned (1≤ρ≤28672), subnormal/underflow (ρ>28672), whereρ = g_amax/b_amax.|e| < b_amax/24is annihilated — intrinsic to NVFP4, independent ofg_amax.g_amaxupward within[B_max, 28672·B_min](saturation is catastrophic, subnormal is graceful), e.g.g_amax ≈ B_max·slack^0.65.Notes
experimental/as a research/educational study (per that directory's README); no changes to productionmodeloptcode.python experimental/nvfp4_global_scale_study/nvfp4_global_scale_study.py🤖 Generated with Claude Code