Skip to content

feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571

Open
Fridah-nv wants to merge 8 commits into
mainfrom
fridah/layerwise-config
Open

feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571
Fridah-nv wants to merge 8 commits into
mainfrom
fridah/layerwise-config

Conversation

@Fridah-nv
Copy link
Copy Markdown
Contributor

@Fridah-nv Fridah-nv commented May 29, 2026

What does this PR do?

Type of change: new feature

Groups all layerwise-calibration options under a nested LayerwiseConfig and adds three new behavior knobs to it. All changes are backward compatible.

1. Nested layerwise config

QuantizeAlgorithmConfig.layerwise changes from bool to a Pydantic submodel:

class LayerwiseConfig(ModeloptBaseConfig):
    enable: bool = False
    get_qdq_activations_from_prev_layer: bool = False
    checkpoint_dir: str | None = None
    save_every: int = 1
    save_quantizers_only: bool = False

Backward compatibility:

  • layerwise: True/False still accepted (emits DeprecationWarning).
  • Flat layerwise_checkpoint_dir silently migrated into layerwise.checkpoint_dir.
  • Legacy use_sequential alias preserved (and resolved during flat-key migration so it can't be dropped).
  • Conflicting flat+nested checkpoint_dir values raise.
  • All 7 shipped PTQ recipes (modelopt_recipes/general/ptq/*.yaml, huggingface/qwen3_5*/ptq/*.yaml) migrated to the
    canonical nested shape — no semantic change.

2. get_qdq_activations_from_prev_layer — correct GPTQ vs max-calib semantics

Controls what layer N's calibration sees:

  • True (GPTQ default): activations carry the quantize-dequantize error of layers 0..N-1 — GPTQ's Hessian-compensation
    goal.
  • False (max/mse/local_hessian default): full-precision activations, matching the non-layerwise pass exactly.

The False branch wraps the next-layer input capture forward with the existing set_quantizer_by_cfg_context deny-all idiom
({"quantizer_name": "*", "enable": False}).

GPTQ's per-algorithm default is enforced via @model_validator(mode="after") that reads LayerwiseConfig.model_fields_set
works for every input shape (empty constructor, bool, partial dict, full dict) and lets explicit user values override.

3. save_every — gate the large activation-cache writes

save_every: int = 1 (ge=1). With N > 1, the per-layer next_inputs.pt (cached activation tensors, the largest checkpoint
artifact for most models) is only written for the boundary layer of each N-layer window. Per-layer
weight/quantizer/output_meta files are still written every layer (resume needs them to replay skip layers correctly).
Interrupting mid-window re-calibrates that window on resume.

UPDATE: drop this in current PR and moved to #1640

4. save_quantizers_only — algorithm-aware weight-blob skipping

save_quantizers_only: bool = False. When True, skip weights.pt entirely and persist just the per-quantizer state_dict
slice (carries _amax) to a new quantizer_buffers.pt. On resume, full_restore reloads only the quantizer slice and
trusts that algorithm semantics didn't mutate layer.weight.

Safety is enforced by a whitelist: _supports_save_quantizers_only: ClassVar[bool] = False on QuantizeAlgorithmConfig,
overridden to True only on MaxCalibConfig, MseCalibConfig, LocalHessianCalibConfig (audited — these only touch
_amax). Weight-mutating algorithms (GPTQ folds Hessian updates, AWQ/SmoothQuant fold pre-quant scales) reject the flag at
config-construction time so in-place weight updates can't be silently lost on resume.

Usage

import modelopt.torch.quantization as mtq

# GPTQ — `get_qdq_activations_from_prev_layer` defaults to True (Hessian semantics).
# save_every reduces activation-cache I/O.
mtq.quantize(
    model,
    {
        "quant_cfg": [...],
        "algorithm": {
            "method": "gptq",
            "layerwise": {
                "enable": True,
                "checkpoint_dir": "/path/to/ckpts",
                "save_every": 4,
            },
        },
    },
    forward_loop=forward_loop,
)

# Max-calibration — `get_qdq_activations_from_prev_layer` defaults to False (FP from prior layers).
# save_quantizers_only skips the weights blob since max only updates _amax.
mtq.quantize(
    model,
    {
        "quant_cfg": [...],
        "algorithm": {
            "method": "max",
            "layerwise": {
                "enable": True,
                "checkpoint_dir": "/path/to/ckpts",
                "save_quantizers_only": True,
            },
        },
    },
    forward_loop=forward_loop,
)

Testing

New / updated unit tests in tests/unit/torch/quantization/:

  • test_config_validation.pyTestLayerwiseNestedConfig covers nested-form acceptance, bool-form
    DeprecationWarning, flat layerwise_checkpoint_dir migration, conflicting flat+nested checkpoint_dir, use_sequential
    alias survival under migration, per-algorithm qdq defaults (parametrized Max/GPTQ), save_every ge=1 validation, and the
    save_quantizers_only whitelist — parametrized rejection on [GPTQ, AWQLite, SmoothQuant] and acceptance on [Max, Mse, LocalHessian].

  • test_layerwise_calibrate.py

    • test_layerwise_no_qdq_matches_sequential_amax — behavioral equivalence: layerwise + qdq=False produces the same
      per-quantizer _amax as the non-layerwise (sequential) max-calibration flow (verified via torch.testing.assert_close).
    • test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries — window-save layout (all layer dirs present,
      next_inputs.pt only at boundaries).
    • test_layerwise_save_quantizers_only_resume_matches_one_shot_amax — end-to-end resume: full run → manifest rewound →
      fresh model resumes → final _amax matches the one-shot baseline; also pins the on-disk shape (no weights.pt,
      quantizer_buffers.pt present).

End-to-end correctness verification

Ran 4 PTQ jobs on Qwen3-8B with NVFP4 W4A16 quant_cfg and --calib_size 16, one GPU
each on 4 GPUs

Run Algorithm Layerwise config Purpose
A GPTQ enable=true, get_qdq_activations_from_prev_layer=true, save_every=5 New nested form + the newsave_every knob
B MSE enable=true, get_qdq_activations_from_prev_layer=false, save_quantizers_only=true New nested form + the new save_quantizers_only knob
C GPTQ Legacy flat form: layerwise: true, layerwise_checkpoint_dir: ... Backward-compat baseline for A (exercises the migration validator)
D MSE enable=false Non-layerwise baseline for B

Across two pairwise comparisons (A vs C ; B vs D), all 905 tensors in the exported HF checkpoints are bit-identical with hf_quant_config.json matching exactly — confirming the new layerwise knobs preserve correctness and the flat-form backward-compat path is intact.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines
and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best
Practices
(e.g.
avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ — layerwise: True/False still accepted with a DeprecationWarning; flat
    layerwise_checkpoint_dir silently migrated; use_sequential alias preserved. The two new knobs default to no-op behavior
    (save_every=1, save_quantizers_only=False).
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
    — no new dependencies.
  • Did you write any new necessary tests?: ✅ — see Testing section.
  • Did you update Changelog?: ✅
    ready for review.
  • Did you get Claude approval on this PR?: ✅ — /claude review consulted iteratively; review findings (GPTQ default
    survival, save_quantizers_only whitelist scope, docstring accuracy, dead layer param) addressed in-PR.

Additional Information

Notes on design choices that came out of internal review:

  • GPTQ's qdq=True default uses a model_validator(mode="after") + model_fields_set check rather than a default_factory
    — a default_factory is only fired when the field is absent, so any user-supplied dict (the natural way to enable
    layerwise) would silently lose the GPTQ default.
  • save_quantizers_only is enforced as a whitelist (_supports_save_quantizers_only) rather than a per-algorithm blacklist,
    which keeps future weight-mutating algorithms safe by default.
  • set_quantizer_by_cfg_context is reused for the qdq-disable scope instead of a bespoke helper, keeping model_calib.py
    aligned with the existing "deny-all" idiom documented at conversion.py:240.
  • Pre-validation recipe helpers in examples/llm_ptq/example_utils.py (needs_checkpoint_path_update /
    resolve_checkpoint_dir) accept both flat and nested shapes since they run before Pydantic validation;
    resolve_checkpoint_dir now also returns the resolved path so the caller doesn't re-derive it.

Summary by CodeRabbit

  • New Features
    • Layerwise calibration now uses a nested config (enable, get_qdq_activations_from_prev_layer, checkpoint_dir, save_every, save_quantizers_only); checkpointing supports quantizer-only saves and windowed commits with resume support.
  • Behavior
    • GPTQ calibration defaults get_qdq_activations_from_prev_layer=True when unspecified; save_every controls when next-inputs/manifests are written and must be positive.
  • Deprecations
    • Legacy flat-style layerwise keys are still accepted but emit DeprecationWarning and are auto-migrated (conflicts detected).
  • Examples/UX
    • Tools auto-detect legacy vs nested checkpoint layouts and report the resolved path.
  • Tests
    • Expanded coverage for nested configs, validation, checkpoint/resume semantics, and windowed saves.

@Fridah-nv Fridah-nv requested review from a team as code owners May 29, 2026 23:02
@Fridah-nv Fridah-nv requested a review from cjluo-nv May 29, 2026 23:02
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine.

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR restructures layerwise quantization calibration config from flat boolean and separate checkpoint-dir into a nested LayerwiseConfig with options enable, get_qdq_activations_from_prev_layer, checkpoint_dir, save_every, and save_quantizers_only. It adds migration/validation, wires the options through calibration and checkpointing, updates examples/recipes, and expands tests.

Changes

Layerwise Configuration Restructuring

Layer / File(s) Summary
LayerwiseConfig schema, validation, and legacy migration
modelopt/torch/quantization/config.py, tests/unit/torch/quantization/test_config_validation.py
Introduces LayerwiseConfig and normalization/migration of legacy layerwise boolean/alias and layerwise_checkpoint_dir into the nested shape (emits DeprecationWarning). Enforces validation (checkpoint_dir requires enable=True) and adds GPTQ defaulting for get_qdq_activations_from_prev_layer. Changelog updated.
Calibration pipeline consumption
modelopt/torch/quantization/mode.py, modelopt/torch/quantization/model_calib.py
Entry points parse nested layerwise options (enable, checkpoint_dir, get_qdq_activations_from_prev_layer, save_every), forward them into layerwise_calibrate, adjust inter-layer capture timing (qdq_from_prev), and initialize _CheckpointState with save_every. Error/doc wording updated to layerwise.enable.
Checkpoint persistence with windowing
modelopt/torch/quantization/utils/layerwise_calib.py, tests/unit/torch/quantization/test_layerwise_calibrate.py
Manifest now stores save_every. Adds _save_layer_files for per-layer artifacts; _CheckpointState accepts save_every, validates resumes against manifest num_layers and save_every, and only commits next_inputs.pt/manifest at window boundaries or final layer. Restore applies quantizer state before weights.
User-facing integration: examples, HF script, and recipes
examples/llm_ptq/example_utils.py, examples/llm_ptq/hf_ptq.py, modelopt_recipes/general/ptq/*.yaml, modelopt_recipes/huggingface/*/*.yaml
_layerwise_checkpoint_dir_location() detects flat vs nested checkpoint-dir shapes; resolve_checkpoint_dir() returns (quant_cfg, resolved_path) and writes the resolved path back in the original recipe shape. HF PTQ script checks layerwise.enable and prints the resolved dir. Multiple recipes updated to nested layerwise: { enable: ... }.
Tests: validation and layerwise calibration behavior
tests/unit/torch/quantization/test_config_validation.py, tests/unit/torch/quantization/test_layerwise_calibrate.py
Adds focused tests for use_sequential deprecation alias, nested layerwise acceptance and migration, conflicting flat vs nested checkpoint_dir inputs, GPTQ defaults/overrides, qdq-from-prev behavior, save_every windowing and resume/crash semantics, and save_every mismatch detection.
Changelog documentation
CHANGELOG.rst
Documents introduction of nested LayerwiseConfig with new checkpoint and QDQ activation options, and deprecation of legacy boolean and flat checkpoint_dir fields.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Config as QuantizeAlgorithmConfig
  participant Pipeline as CalibPipeline
  participant Checkpoint as _CheckpointState

  User->>Config: provide legacy bool or nested layerwise config
  Config->>Config: coerce legacy boolean/alias → nested LayerwiseConfig (emit DeprecationWarning)
  Config->>Config: migrate layerwise_checkpoint_dir → layerwise.checkpoint_dir (if present)
  Config->>Pipeline: return normalized LayerwiseConfig

  Pipeline->>Pipeline: extract enable, checkpoint_dir, save_every, get_qdq_activations_from_prev_layer
  alt layerwise.enable == True
    Pipeline->>Checkpoint: _CheckpointState.from_folder(checkpoint_dir, num_layers, save_every)
    loop per layer
      Pipeline->>Pipeline: calibrate layer
      alt get_qdq_activations_from_prev_layer == False
        Pipeline->>Pipeline: capture next_inputs before calib_func (disable quantizers on current layer)
      else get_qdq_activations_from_prev_layer == True
        Pipeline->>Pipeline: capture next_inputs after calib_func (allow QDQ effects)
      end
      alt layer_idx % save_every == 0 OR final layer
        Pipeline->>Checkpoint: save(layer_idx, model, layers, next_inputs?)
        Checkpoint->>Checkpoint: write qstate, (weights.pt OR quantizer_buffers.pt), output_meta
        Checkpoint->>Checkpoint: commit manifest.json and next_inputs.pt
      else
        Pipeline->>Checkpoint: write per-layer files without committing manifest
      end
    end
  else
    Pipeline->>Pipeline: run sequential calibration
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • cjluo-nv
  • Edwardf0t1
  • kinjalpatel27
  • meenchen

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Line 671-672 of layerwise_calib.py has torch.load(..., weights_only=False) without an inline comment justifying why it is safe, violating SECURITY.md. Add inline comment before weights torch.load(): # weights_only=False is safe: file is internally generated by _save_layer
Docstring Coverage ⚠️ Warning Docstring coverage is 54.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: introducing nested LayerwiseConfig, adding the QDQ-from-prev-layer flag, and new checkpoint I/O controls.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fridah/layerwise-config

Comment @coderabbitai help to get the list of available commands and usage tips.

@Fridah-nv Fridah-nv requested review from realAsma and sugunav14 May 29, 2026 23:03
@Fridah-nv
Copy link
Copy Markdown
Contributor Author

/claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1571/

Built to branch gh-pages at 2026-06-05 23:40 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

The nested LayerwiseConfig refactor, get_qdq_activations_from_prev_layer flag, and save_quantizers_only whitelist look well-designed and the back-compat shims (bool coercion + flat→nested migration with the use_sequential rescue) are thorough. Tests around config validation and the qdq=False-matches-sequential-amax behavioral check are good.

However, save_every > 1 looks broken: by the time _CheckpointState.save() is invoked for the boundary layer, the earlier layers in the window have already been replaced by _SkipLayer instances in transformer_layers (via _set_layer_states(layer_idx+1) swapping layer_idx-1 to a dummy during cache_outputs_for_next_layer_calib). Reading layers[i].state_dict() / quantizer_state(layers[i]) / layers[i]._layerwise_calib.output_meta on a _SkipLayer returns empty/None state because _original is attached via object.__setattr__ and is not a registered submodule. The new test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries only checks file existence, so the silent corruption isn't caught — there is no resume test for save_every > 1. See inline comment for details and the suggested fix (snapshot to memory immediately after each layer is calibrated, then flush at boundaries; or save real layers eagerly and only window-gate next_inputs.pt).

Minor: changelog says "TODO before marking ready for review" in the PR body but the changelog entry is already added — please confirm the description is up to date before merge.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Comment thread modelopt/torch/quantization/utils/layerwise_calib.py
Comment thread modelopt/torch/quantization/config.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 153-160: Add an explicit __all__ declaration near the top of the
module that lists the new public API symbols (e.g. "LayerwiseConfig" and any
other public config classes, functions or constants defined in this file and the
other block referenced around 636-691); place it after the imports and ensure it
enumerates every symbol intended for export so star-imports are safe and stable,
and update corresponding package __init__.py re-exports if needed to re-export
these names.
- Around line 706-708: _coerce_layerwise_input currently returns
value.model_dump() for LayerwiseConfig which expands defaults and loses the
original model_fields_set causing GPTQ's _gptq_qdq_default to miss injecting
defaults; change _coerce_layerwise_input to return the LayerwiseConfig instance
(value) unchanged when isinstance(value, LayerwiseConfig) so downstream parsing
preserves model_fields_set (e.g., self.layerwise.model_fields_set) and
GPTQCalibConfig._gptq_qdq_default can correctly detect which fields the user
actually set.

In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 729-761: The loop that saves windowed checkpoints is capturing
_SkipLayer placeholders rather than the underlying calibrated layer because
cache_outputs_for_next_layer_calib() can replace entries in layers with
_SkipLayer before save; update the save loop in the function that iterates
layers[i] to detect instances of _SkipLayer and use layer = layer._original (or
otherwise snapshot the original layer list before cache outputs mutate it) so
_save_layer receives the real calibrated layer, its quantizer state
(quantizer_state(layer)), and correct output_meta rather than the placeholder.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e32cbbac-64bc-4625-80b8-93a1790c3d62

📥 Commits

Reviewing files that changed from the base of the PR and between 40a4dd3 and dbca1a5.

📒 Files selected for processing (16)
  • CHANGELOG.rst
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/layerwise_calib.py
  • modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/qwen3_5_moe/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml
  • tests/unit/torch/quantization/test_config_validation.py
  • tests/unit/torch/quantization/test_layerwise_calibrate.py

Comment thread modelopt/torch/quantization/config.py
Comment thread modelopt/torch/quantization/config.py
Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review — 1 CRITICAL, 0 IMPORTANT, 2 SUGGESTIONs.

Most impactful finding (inline on modelopt/torch/quantization/utils/layerwise_calib.py):

_save_layer calls _write_manifest(...) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. So if the process is killed (or torch.save raises) between the first and last _save_layer call within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, _CheckpointState.setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — there's no fallback that re-calibrates the partial window. The PR description's claim that "interrupting mid-window re-calibrates that window on resume" is the intended invariant, but the implementation doesn't enforce it.

The fix is small: only advance the manifest after the entire window has been flushed (drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() with last_completed_layer = layer_idx). A regression test that monkeypatches _write_manifest to raise mid-window would lock this in. With save_every=1 (the default) the bug isn't reachable, so the existing tests don't catch it — this only bites users who opt into the new knob.

Other:

  • SUGGESTION: legacy flat layerwise_checkpoint_dir is migrated silently in _migrate_layerwise_checkpoint_dir. The bool-form layerwise emits a DeprecationWarning; consider symmetric treatment so the migration window for the flat key is signaled.
  • SUGGESTION: resolve_checkpoint_dir in examples/llm_ptq/example_utils.py only updates whichever shape it picks — if the user happens to specify both flat and nested with the same value, the rewrite produces a Pydantic conflict whose error message points at the resolver's output, not at the user's redundancy. Edge case but worth a guard.

Risk: Low–medium. The default path (save_every=1, no save_quantizers_only) is well-tested and the algorithm-correctness pieces (qdq-from-prev semantics, GPTQ default via model_fields_set, save_quantizers_only whitelist) look right and are covered by the new tests. The save_every>1 path has the resume-corruption bug above, which only opt-in users will hit.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py
Comment thread modelopt/torch/quantization/config.py
Comment thread examples/llm_ptq/example_utils.py
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: the previously flagged critical save_every > 1 corruption bug is still present in modelopt/torch/quantization/utils/layerwise_calib.py. The new test only checks file existence, not file contents, so the corruption remains silent. Other previous comments (manifest update during partial-window crash, _coerce_layerwise_input model_dump() losing model_fields_set) are also unaddressed. Marking comment rather than nudge because the critical bug from the previous review is unresolved.

Status of previous comments:

  • (critical, UNRESOLVED) save_every > 1 reads from _SkipLayer stand-ins → silent on-disk corruption. Still no resume test for save_every > 1.
  • (critical, UNRESOLVED) Manifest is advanced per-_save_layer call, so a mid-window crash leaves manifest=K with no next_inputs.pt for layer K, breaking resume permanently.
  • (major, UNRESOLVED) _coerce_layerwise_input calls .model_dump() on a LayerwiseConfig instance, which expands defaults and breaks _gptq_qdq_default's reliance on model_fields_set.
  • (minor) Cross-version save_every / save_quantizers_only mismatch on resume — still no manifest-side guard.
  • (minor) __all__ declaration in config.py — still missing.
  • (minor) Symmetric DeprecationWarning for flat layerwise_checkpoint_dir migration — still silent.
  • (minor) resolve_checkpoint_dir flat+nested redundancy — still no guard.

Resolved/no-longer-flagging:

  • Changelog entry is present and the "TODO before marking ready" note appears to have been dropped from the PR body.

Design-review pass: the PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the design protocol's "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable; correctness, not design, is the issue here.


Additional comments (outside the PR diff):

  • modelopt/torch/quantization/utils/layerwise_calib.py:522 — > Bot comment.

Critical (carried over, still unresolved): _save_layer calls _write_manifest(checkpoint_dir, idx, num_layers) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. If the process is killed (or torch.save raises) between the first _save_layer and the boundary _save_layer within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — the PR description's claim that "interrupting mid-window re-calibrates that window on resume" is not what the code does.

Fix: drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() (with last_completed_layer = layer_idx) so the manifest only advances after the boundary next_inputs.pt is on disk. A regression test that monkeypatches _write_manifest to raise mid-window would lock this in.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Comment thread modelopt/torch/quantization/config.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.92%. Comparing base (54ce4e0) to head (ab19be0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1571      +/-   ##
==========================================
+ Coverage   76.27%   76.92%   +0.65%     
==========================================
  Files         489      489              
  Lines       54415    54475      +60     
==========================================
+ Hits        41504    41904     +400     
+ Misses      12911    12571     -340     
Flag Coverage Δ
examples 42.79% <42.52%> (+0.78%) ⬆️
gpu 58.45% <96.55%> (-1.42%) ⬇️
regression 14.90% <35.63%> (-0.22%) ⬇️
unit 54.05% <98.85%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fridah-nv
Copy link
Copy Markdown
Contributor Author

/claude review

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: all previously-flagged critical issues are resolved with regression tests.

  • save_every > 1 _SkipLayer corruption (critical, was unresolved) — Fixed. _CheckpointState.save() now reads from layers[layer_idx] (the just-calibrated layer) and is called every iteration before the next-layer capture forward swaps prior layers to _SkipLayer. Per-layer files (weights/quantizer state/output_meta) are written eagerly; only next_inputs.pt + manifest are deferred to window boundaries.
  • Mid-window crash advancing manifest (critical, was unresolved) — Fixed. _save_layer_files no longer writes the manifest; _CheckpointState.save() writes it once at boundary/final, so a torch.save crash mid-window leaves the manifest pointing at the previous boundary. Covered by test_layerwise_save_every_mid_window_crash_recovers_at_prev_boundary (monkeypatches torch.save to crash on layer 2 mid-window, asserts manifest stays at layer 1).
  • _coerce_layerwise_input .model_dump() expanding defaults (critical/major, was unresolved) — Fixed. Now uses model_dump(exclude_unset=True) so LayerwiseConfig instance inputs preserve model_fields_set, and GPTQ's _gptq_qdq_default correctly injects True for fields the user didn't explicitly set. Test parametrization extended to cover LayerwiseConfig(enable=True) + LayerwiseConfig(enable=True, get_qdq_activations_from_prev_layer=False).
  • Save_every>1 resume parity test (was missing) — Added. test_layerwise_checkpoint_resume_matches_one_shot_amax parametrized over (save_every=2, save_quantizers_only=False) and (save_every=1, save_quantizers_only=True) rewinds the manifest and asserts amax equivalence with a one-shot baseline — i.e. it would fail under the previous corruption.
  • Cross-version save_every/save_quantizers_only mismatch (minor) — Addressed. Manifest now persists both keys; from_folder raises on any mismatch. Covered by test_layerwise_checkpoint_mismatch_save_every_raises.
  • Symmetric DeprecationWarning for flat layerwise_checkpoint_dir (minor) — Addressed; warning emitted in _migrate_layerwise_checkpoint_dir.
  • resolve_checkpoint_dir flat+nested redundancy (minor) — Addressed; resolver writes back to both shapes when both present, keeping the dict internally consistent.
  • __all__ declaration (minor) — Author marked out of scope; acceptable.

Design-review pass: this PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable.

Complex PR: spans 8 directories (≥ 5). Looping in a human for approval.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Comment thread modelopt/torch/quantization/model_calib.py Outdated
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: all four previously-flagged critical issues are resolved with meaningful regression tests.

  • 💬 save_every > 1 _SkipLayer corruption (critical, was unresolved) — Fixed. _CheckpointState.save() now reads from layers[layer_idx] (the just-calibrated layer) before the next iteration's capture forward swaps prior layers to _SkipLayer. Per-layer files (weights/quantizer state/output_meta) are written eagerly; only next_inputs.pt + manifest are deferred to window boundaries. Covered by the parametrized test_layerwise_checkpoint_resume_matches_one_shot_amax(save_every=2) parity test against a one-shot baseline.
  • 💬 Mid-window crash advancing manifest (critical, was unresolved) — Fixed. _save_layer_files no longer writes the manifest; _CheckpointState.save() writes it once at boundary/final, so a torch.save crash mid-window leaves the manifest pointing at the previous boundary. Covered by test_layerwise_save_every_mid_window_crash_recovers_at_prev_boundary (monkeypatches torch.save to crash on layer 2 mid-window, asserts manifest stays at layer 1).
  • 💬 _coerce_layerwise_input .model_dump() expanding defaults (critical, was unresolved) — Fixed. Now uses model_dump(exclude_unset=True) so LayerwiseConfig instance inputs preserve model_fields_set, and GPTQ's _gptq_qdq_default correctly injects True for fields the user didn't explicitly set. Test parametrization extended to cover LayerwiseConfig(enable=True) and LayerwiseConfig(enable=True, get_qdq_activations_from_prev_layer=False).
  • 💬 qdq_from_prev=False weight-mutation leak (was raised by realAsma) — Fixed. Next-layer input capture moved to before calib_func when qdq_from_prev=False, so weight-mutating algorithms (GPTQ/AWQ/SmoothQuant folds) can't leak into downstream captures. Covered by test_layerwise_no_qdq_captures_inputs_before_calib_func_mutates_weights (destructive vs identity calib_func produce identical captured inputs).
  • 💬 Cross-version save_every / save_quantizers_only mismatch (minor) — Addressed. Manifest now persists both keys; from_folder raises on any mismatch. Covered by test_layerwise_checkpoint_mismatch_save_every_raises.
  • 💬 Symmetric DeprecationWarning for flat layerwise_checkpoint_dir (minor) — Addressed; warning emitted in _migrate_layerwise_checkpoint_dir.
  • 💬 resolve_checkpoint_dir flat+nested redundancy (minor) — Addressed; resolver writes the resolved path back to both shapes when both are present so the migration validator sees consistent values.
  • 💬 __all__ declaration in config.py (minor) — Author marked out of scope; acceptable.

Design-review pass: this PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable.

Looping in a human for final sign-off because (a) complexity gate fired (16 files, +853/-135, 8 directories), and (b) author's reported end-to-end verification (Qwen3-8B, 4 PTQ runs, 905 tensors bit-identical across legacy-vs-nested and nested-vs-baseline pairs) is a strong correctness signal but lives outside CI — a human reviewer with the runtime context should confirm the bit-equivalence claim is sufficient.

@realAsma
Copy link
Copy Markdown
Contributor

realAsma commented Jun 1, 2026

@Fridah-nv can you take a look at #1592?

I have a improvement for layerwise (Avoid write back with FSDP2 / CPU/Disk offload unless needed)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/utils/layerwise_calib.py (1)

630-645: ⚠️ Potential issue | 🔴 Critical

Fix insecure resume deserialization in layerwise calibration (torch.load weights_only=False)

  • modelopt/torch/quantization/utils/layerwise_calib.py loads output_meta.pt, next_inputs.pt, quantizer_state.pt, and weights.pt via torch.load(..., weights_only=False) in _CheckpointState.setup_resume() and _CheckpointState.full_restore(). The inline “internally generated” justification isn’t a valid trust boundary because checkpoint_dir is caller-configurable and resume may read pre-seeded/tampered files, making this an unsafe pickle-backed code-execution path.
  • Change resume loading to avoid weights_only=False where feasible (e.g., use weights_only=True for weights.pt / quantizer_state.pt if compatible), and for output_meta.pt / next_inputs.pt switch to a pickle-free serialization or add an explicit provenance/trust gate tied to files written by this run.
  • Also remove or update the unused quantizer_buffers branch/docstring in _save_layer_files (it’s never passed from save() and full_restore() doesn’t load quantizer_buffers.pt).
🧹 Nitpick comments (1)
modelopt/torch/quantization/utils/layerwise_calib.py (1)

501-524: ⚡ Quick win

Drop the dead quantizer_buffers checkpoint path.

save() now always writes weights.pt, while full_restore() only reads weights.pt, so keeping quantizer_buffers in _save_layer_files() and in its docstring advertises a restore mode this file no longer implements. Removing the unused parameter/branch would keep the checkpoint format single-sourced and avoid future callers writing unreadable checkpoints.

♻️ Minimal cleanup
 def _save_layer_files(
     checkpoint_dir: str,
     idx: int,
-    weights: dict | None,
+    weights: dict,
     qstate: dict,
-    quantizer_buffers: dict | None,
     output_meta: tuple,
 ) -> None:
     """Write the per-layer files for layer *idx*.
 
-    Exactly one of ``weights`` (full layer state_dict) or ``quantizer_buffers``
-    (just the TensorQuantizer state_dict slice) is written; ``full_restore``
-    falls back to whichever is present. ``next_inputs.pt`` and ``manifest.json``
-    are deferred to window boundaries in :meth:`_CheckpointState.save`.
+    ``weights.pt``, ``quantizer_state.pt``, and ``output_meta.pt`` are written
+    every call. ``next_inputs.pt`` and ``manifest.json`` are deferred to window
+    boundaries in :meth:`_CheckpointState.save`.
     """
     d = _layer_dir(checkpoint_dir, idx)
     if os.path.isdir(d):
         shutil.rmtree(d)
     os.makedirs(d)
-    if weights is not None:
-        torch.save(weights, os.path.join(d, "weights.pt"))
-    elif quantizer_buffers is not None:
-        torch.save(quantizer_buffers, os.path.join(d, "quantizer_buffers.pt"))
+    torch.save(weights, os.path.join(d, "weights.pt"))
     torch.save(qstate, os.path.join(d, "quantizer_state.pt"))
     torch.save(output_meta, os.path.join(d, "output_meta.pt"))

As per coding guidelines, "Remove dead code including unused imports, unreachable branches, and obsolete helpers."

Also applies to: 713-719

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/utils/layerwise_calib.py` around lines 501 - 524,
The _save_layer_files function still accepts and documents the obsolete
quantizer_buffers path even though save()/full_restore() only use weights.pt;
remove the dead parameter and branch: delete the quantizer_buffers parameter
from _save_layer_files signature and all callers, remove the elif branch that
writes "quantizer_buffers.pt", and update the docstring to state only weights.pt
and quantizer_state.pt are written; also remove any duplicated/unused
quantizer_buffers handling in the later block around the other occurrence (lines
~713-719) so the checkpoint format remains single-sourced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 501-524: The _save_layer_files function still accepts and
documents the obsolete quantizer_buffers path even though save()/full_restore()
only use weights.pt; remove the dead parameter and branch: delete the
quantizer_buffers parameter from _save_layer_files signature and all callers,
remove the elif branch that writes "quantizer_buffers.pt", and update the
docstring to state only weights.pt and quantizer_state.pt are written; also
remove any duplicated/unused quantizer_buffers handling in the later block
around the other occurrence (lines ~713-719) so the checkpoint format remains
single-sourced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b626c7ad-d66e-4d09-9923-a284fb419068

📥 Commits

Reviewing files that changed from the base of the PR and between af1bb4f and 51bc967.

📒 Files selected for processing (7)
  • CHANGELOG.rst
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/layerwise_calib.py
  • tests/unit/torch/quantization/test_config_validation.py
  • tests/unit/torch/quantization/test_layerwise_calibrate.py
💤 Files with no reviewable changes (3)
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/mode.py
  • tests/unit/torch/quantization/test_config_validation.py
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst

Comment thread modelopt/torch/quantization/model_calib.py
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review after extensive iteration. The four previously-flagged criticals all appear addressed with regression tests:

  • 💬 save_every > 1 _SkipLayer corruption — Fixed. _CheckpointState.save() now reads from layers[layer_idx] per-call, before the next iteration's capture forward swaps prior layers to _SkipLayer. Per-layer files written eagerly; only next_inputs.pt + manifest deferred to boundaries.
  • 💬 Mid-window crash advancing manifest — Fixed. _save_layer_files no longer writes the manifest; save() writes it once at boundary/final. Covered by test_layerwise_save_every_mid_window_crash_recovers_at_prev_boundary (monkeypatches torch.save to crash on layer 2 mid-window, asserts manifest stays at layer 1).
  • 💬 _coerce_layerwise_input .model_dump() expanding defaults — Fixed. Now uses model_dump(exclude_unset=True). Tests cover LayerwiseConfig(enable=True) preserving model_fields_set so GPTQ's _gptq_qdq_default injects correctly.
  • 💬 realAsma's qdq_from_prev=False weight-mutation leak — Fixed. Next-layer input capture moved to before calib_func when qdq_from_prev=False. Covered by test_layerwise_no_qdq_captures_inputs_before_calib_func_mutates_weights (destructive vs identity calib_func produce identical captured inputs).
  • 💬 Cross-version save_every mismatch, symmetric DeprecationWarning for flat key, resolve_checkpoint_dir redundancy guard — all addressed.
  • __all__ declaration in config.py — author marked out of scope; acceptable.

Looping in a human for final sign-off because:

  1. Complexity gate fired (16 files, +746/-130, 8 directories).
  2. The author's strongest correctness signal — bit-identical 905-tensor exports across (legacy-flat, nested) and (layerwise-MSE, baseline-MSE) Qwen3-8B runs — lives outside CI; a human reviewer with the runtime context should confirm that's sufficient before merge.
  3. Prior review iteration discovered 3 critical bugs that the unit tests didn't catch on the first pass; given the new opt-in surface (save_every, qdq_from_prev), one more set of human eyes is warranted.

Minor non-blocking observation: _save_layer_files still carries a dead quantizer_buffers parameter / branch (the save_quantizers_only feature was split out to PR #1640 per the PR body, but the helper signature kept the slot). Worth a follow-up cleanup but not a merge blocker.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py
for batch in calib_data:
m(batch)

mtq.quantize(model_seq, _int8_layerwise_config({"method": "max"}), forward_loop=fwd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?
why are we passing layserwise config here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Bot comment.

This is correct, but the helper name is confusing. _int8_layerwise_config just starts from the shipped INT8 config so the test uses the same real quant_cfg, then replaces cfg["algorithm"] with the dict passed in.

In this first call we pass only {"method": "max"}, so layerwise is not enabled and it exercises the normal sequential max-calibration baseline. The second call adds layerwise.enable=True plus get_qdq_activations_from_prev_layer=False, so the test compares that layerwise path against the sequential baseline under the same quantizer config.

We should rename the helper or update its docstring to make that clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _int8_layerwise_config name is mis-leading, it's just "an INT8 config with this algorithm block." updating the name and description.

)


def test_layerwise_no_qdq_matches_sequential_amax(monkeypatch):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv can you double check this test?

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fridah-nv and others added 8 commits June 5, 2026 23:31
…layer

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
The skip-layer-weight-checkpoint optimization (save_quantizers_only) is
moved out of this foundation PR; it lands complete in the stacked PR #1592
(as calib_mutates_weights). _CheckpointState now always saves the full layer
weights blob. save_every and the rest of the nested LayerwiseConfig are kept.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Follow-ups on dropping save_quantizers_only:
- Remove the now-dead quantizer_buffers checkpoint path in _save_layer_files
  (save() always writes weights.pt; full_restore() only reads it).
- Rename _int8_layerwise_config -> _int8_cfg_with_algorithm and clarify its
  docstring: it only overrides the algorithm block; layerwise runs only when
  the algorithm sets layerwise.enable (a plain {method: ...} is sequential).
- Keep both next-layer captures and ckpt.save under persistent_materialization
  so the capture forward reuses the resident weights instead of
  re-materializing per batch on offloaded/sharded models.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv force-pushed the fridah/layerwise-config branch from ac9930d to ab19be0 Compare June 5, 2026 23:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/utils/layerwise_calib.py (1)

625-673: ⚠️ Potential issue | 🔴 Critical

Do not treat resume checkpoints as implicitly trusted before torch.load(weights_only=False)
_CheckpointState.from_folder() resumes based on detect_resume_point(), and _read_manifest() only checks that checkpoint_dir/manifest.json is valid JSON (no signing/integrity/ownership validation). If checkpoint_dir is pre-populated, setup_resume()/full_restore() call torch.load(..., weights_only=False) on output_meta.pt, next_inputs.pt, quantizer_state.pt, and weights.pt, making checkpoint_dir an RCE boundary; the current inline “internally generated” justification is not sufficient for this trust model.
Switch to safe loading (weights_only=True where possible) and/or require explicit trust opt-in and integrity validation for resume directories (e.g., only resume from directories created by this run or verify an authenticity marker/signature).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/utils/layerwise_calib.py` around lines 625 - 673,
The resume flow treats arbitrary checkpoint_dir content as trusted and uses
unsafe torch.load(..., weights_only=False); update the resume code paths
(_CheckpointState.from_folder()/setup_resume() and full_restore()) to (1) use
torch.load(..., weights_only=True) for files that contain tensors only
(output_meta.pt, next_inputs.pt, quantizer_state.pt, weights.pt) so
deserialization of arbitrary Python objects is avoided, and (2) add an explicit
opt-in trust check before allowing weights_only=False or any full
deserialization (e.g., a boolean flag on the CheckpointState or a
signature/marker verified by _read_manifest()) and refuse to resume unless the
directory was created by this process or the marker is valid; change the
torch.load calls in setup_resume and full_restore to weights_only=True where
compatible and gate any remaining weights_only=False loads behind the new
explicit trust check and integrity verification.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 625-673: The resume flow treats arbitrary checkpoint_dir content
as trusted and uses unsafe torch.load(..., weights_only=False); update the
resume code paths (_CheckpointState.from_folder()/setup_resume() and
full_restore()) to (1) use torch.load(..., weights_only=True) for files that
contain tensors only (output_meta.pt, next_inputs.pt, quantizer_state.pt,
weights.pt) so deserialization of arbitrary Python objects is avoided, and (2)
add an explicit opt-in trust check before allowing weights_only=False or any
full deserialization (e.g., a boolean flag on the CheckpointState or a
signature/marker verified by _read_manifest()) and refuse to resume unless the
directory was created by this process or the marker is valid; change the
torch.load calls in setup_resume and full_restore to weights_only=True where
compatible and gate any remaining weights_only=False loads behind the new
explicit trust check and integrity verification.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5863b507-feac-435e-a513-e6cf2036feae

📥 Commits

Reviewing files that changed from the base of the PR and between 51bc967 and ac9930d.

📒 Files selected for processing (3)
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/layerwise_calib.py
  • tests/unit/torch/quantization/test_layerwise_calibrate.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • modelopt/torch/quantization/model_calib.py
  • tests/unit/torch/quantization/test_layerwise_calibrate.py

Fridah-nv added a commit that referenced this pull request Jun 5, 2026
save_every); this PR adds the two optimizations on top: calib_mutates_weights
(skip weight checkpoint + writeback for amax-only algorithms) and meta-device
skip-layer placeholders.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants