Add Alpamayo-1 example by rohansjoshi · Pull Request #1594 · NVIDIA/Model-Optimizer

rohansjoshi · 2026-06-01T22:49:53Z

What does this PR do?

Type of change: ? New example

Adds example for Alpamayo-1 quantization with ModelOpt (FP8, NVFP4, AutoQuant)

Usage

python quantize.py --ckpt nvidia/Alpamayo-R1-10B --output-dir ./alpamayo-r1-fp8 --quantize fp8

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added Alpamayo 1 vision-language-action model quantization example supporting FP8, NVFP4, and mixed-precision optimization modes
- Introduced CLI quantization tool with calibration loop and checkpoint export capabilities for both fake-quantized and real-quantized formats
Documentation
- Added comprehensive guide documenting the Alpamayo quantization example, model details, and usage instructions

coderabbitai · 2026-06-01T22:50:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45de7a04-1b1b-4043-af40-c4a0955da987

📥 Commits

Reviewing files that changed from the base of the PR and between 7886c59 and bfcb034.

📒 Files selected for processing (3)

CHANGELOG.rst
examples/alpamayo/README.md
examples/alpamayo/quantize.py

✅ Files skipped from review due to trivial changes (2)

CHANGELOG.rst
examples/alpamayo/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

examples/alpamayo/quantize.py

📝 Walkthrough

Walkthrough

Introduces a new Alpamayo-R1 quantization example with a CLI script, calibration pipeline, and both PTQ (FP8/NVFP4) and AutoQuantize modes. Includes a teacher-forced flow-matching forward pass for calibration, parquet clip loading, and optional real-quantization compression before HF checkpoint export.

Changes

Alpamayo Quantization Example

Layer / File(s)	Summary
Documentation `examples/alpamayo/README.md`, `CHANGELOG.rst`	Introduces README explaining the quantization example, fake vs real-quant modes, precision constraints (vision tower excluded), and AutoQuantize's per-layer NVFP4/FP8 search. Updates changelog with example availability.
Script setup and utilities `examples/alpamayo/quantize.py` (imports, constants, helpers)	Module imports, image pixel bounds (MIN/MAX), base Qwen3-VL processor name, `create_message` to build image+trajectory prompts, `get_processor` with pixel constraints, and `to_device` recursive device/dtype utility.
Teacher-forced forward patch `examples/alpamayo/quantize.py` (lines 126–257)	Implements `_teacher_forced_flow_loss_forward` to build VLM KV cache and compute v_pred/v_target for flow-matching objectives; `patch_teacher_forced_flow_loss_forward` attaches this forward to AlpamayoR1 at import time for use in both PTQ and AutoQuantize calibration.
Calibration data pipeline `examples/alpamayo/quantize.py` (lines 260–327)	`make_joint_calibration_forward_loop` iterates clips, loads frames/ego history, tokenizes the image+trajectory prompt, moves data to device, and runs VLM rollout under FP16 autocast. `read_clip_ids_from_parquet` reads the key column case-insensitively and de-duplicates while preserving order.
PTQ quantization (FP8/NVFP4) `examples/alpamayo/quantize.py` (lines 330–396)	`quantize_model` configures FP8 vs NVFP4 defaults, disables quantization for the vision tower, conditionally disables Linear layers with non-16-multiple dimensions when NVFP4 or real-quant is used, and invokes `mtq.quantize` with the calibration loop.
AutoQuantize search (mixed-precision) `examples/alpamayo/quantize.py` (lines 399–503)	`auto_quantize_model` constructs a re-iterable parquet loader, defines BF16 autocast forward/loss using the teacher-forced method and MSE between v_pred and v_target, disables vision tower and non-16-multiple Linear modules, and runs `mtq.auto_quantize` under an effective-bits budget.
CLI orchestration and checkpoint saving `examples/alpamayo/quantize.py` (lines 505–652)	`main()` parses command-line arguments, enables HF integration, loads model and processor, reads calibration clip IDs from parquet, selects PTQ or AutoQuantize flow, optionally runs `mtq.compress` for real-quant weights, saves model/processor/config via `save_pretrained`, writes `hf_quant_config.json`, and provides the `__main__` entry point.

Sequence Diagram

sequenceDiagram
  participant User as User/CLI
  participant Model as AlpamayoR1 Model
  participant Processor as Qwen3-VL Processor
  participant CalibLoader as Calibration Loader
  participant Quantizer as PTQ/AutoQuantize Engine
  participant Compressor as mtq.compress
  participant Checkpoint as HF Checkpoint

  User->>Model: Load model
  User->>Processor: Initialize with pixel bounds
  
  alt PTQ Path (FP8/NVFP4)
    User->>CalibLoader: Iterate parquet clips
    CalibLoader->>Processor: Tokenize image+trajectory prompt
    CalibLoader->>Model: Run VLM rollout (FP16 autocast)
    Quantizer->>Model: Apply quantization config (disable vision tower, conditionally disable Linear layers)
    Quantizer->>Model: Run mtq.quantize with calibration forward
  else AutoQuantize Path
    User->>CalibLoader: Create re-iterable loader over clips
    CalibLoader->>Processor: Tokenize clips
    CalibLoader->>Model: Compute teacher-forced flow loss (BF16 autocast)
    Quantizer->>Model: Run mtq.auto_quantize (MSE search under bits budget)
  end

  alt Real-Quant Mode
    User->>Compressor: Run mtq.compress (pack weights)
  end

  User->>Checkpoint: Save model/processor/config via save_pretrained
  User->>Checkpoint: Write hf_quant_config.json

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

cjluo-nv
meenchen

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Alpamayo-1 example' is concise and directly summarizes the main change: adding a new example demonstrating Alpamayo-1 quantization.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All security anti-pattern checks passed: no unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, or new unvetted dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch rohjoshi/alpamayo-example

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)

examples/alpamayo/quantize.py (1)

655-681: ⚡ Quick win

Guard checkpoint writes to rank 0 for distributed-safe execution.

Current save/write operations are unconditional; in multi-rank runs, each rank may write the same files concurrently.

🛡️ Proposed fix

+    is_rank0 = (
+        not torch.distributed.is_available()
+        or not torch.distributed.is_initialized()
+        or torch.distributed.get_rank() == 0
+    )
@@
-    os.makedirs(args.output_dir, exist_ok=True)
-    print(f"Saving quantized checkpoint to {args.output_dir!r} ...")
+    if is_rank0:
+        os.makedirs(args.output_dir, exist_ok=True)
+        print(f"Saving quantized checkpoint to {args.output_dir!r} ...")
@@
-        with torch.inference_mode():
-            model.save_pretrained(args.output_dir)
-        processor.save_pretrained(args.output_dir)
-        model.config.save_pretrained(args.output_dir)
+        if is_rank0:
+            with torch.inference_mode():
+                model.save_pretrained(args.output_dir)
+            processor.save_pretrained(args.output_dir)
+            model.config.save_pretrained(args.output_dir)
@@
-        with torch.inference_mode():
-            model.save_pretrained(args.output_dir)
-
-        processor.save_pretrained(args.output_dir)
-        model.config.save_pretrained(args.output_dir)
-
-        quant_cfg = get_quant_config(model)
-        with open(os.path.join(args.output_dir, "hf_quant_config.json"), "w") as f:
-            json.dump(quant_cfg, f)
+        if is_rank0:
+            with torch.inference_mode():
+                model.save_pretrained(args.output_dir)
+            processor.save_pretrained(args.output_dir)
+            model.config.save_pretrained(args.output_dir)
+            quant_cfg = get_quant_config(model)
+            with open(os.path.join(args.output_dir, "hf_quant_config.json"), "w") as f:
+                json.dump(quant_cfg, f)
+
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        torch.distributed.barrier()

As per coding guidelines "Develop with distributed processing in mind ... guarding shared side effects such as file writes ... against race conditions between ranks."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/alpamayo/quantize.py` around lines 655 - 681, The saves under both
the real_quant and else branches (mtq.compress + model.save_pretrained,
processor.save_pretrained, model.config.save_pretrained, and writing
hf_quant_config.json from get_quant_config) must be guarded so only the main
process/rank 0 performs file writes in distributed runs; detect distributed mode
via torch.distributed.is_initialized() and torch.distributed.get_rank() (or the
project’s existing is_main_process helper) and wrap all save_pretrained calls
and the json file write in a conditional that only runs on rank 0 to avoid
concurrent writes to args.output_dir.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/alpamayo/quantize.py`:
- Around line 346-355: The function read_clip_ids_from_parquet currently indexes
cols_lower["key"] which can KeyError; update it to search cols_lower for common
column names (e.g., "key", "clip_id", "clipid", "id", "filename") and use the
first match if present, otherwise fall back to using df.index.astype(str). After
obtaining the list, convert to strings and deduplicate while preserving first
occurrence order (use a seen set/ordered iteration); reference
read_clip_ids_from_parquet and the cols_lower variable to locate the change.
- Around line 27-55: Add an explicit __all__ list at module top to define the
public API; create a single-line assignment __all__ = [...] after the imports
that enumerates the module's public helpers and re-exports (e.g.,
"load_physical_aiavdataset", "AlpamayoR1", "to_special_token",
"create_forward_loop", "get_dataset_dataloader", "export_hf_checkpoint",
"get_quant_config", "mtq", "patch_pretrained_methods", "logger" — adjust to the
actual names this module exposes) and keep it updated when symbols change so
star-import behavior is explicit.
- Around line 500-512: The prints in the hot loop (inside forward_step's
tensor-stat print and loss_func) call .item() on tensors every iteration,
causing unwanted CUDA synchronizations; wrap these debug prints so they only run
when args.debug (and optionally only on the main rank) is true, and compute
.item() values only inside that conditional; update the debug logging in
forward_step (the v_pred/v_target print) and in loss_func (loss print) to be
gated by args.debug and a rank check (e.g., only on rank 0 or when not
distributed) so normal runs avoid per-batch .item() calls and GPU syncs.

---

Nitpick comments:
In `@examples/alpamayo/quantize.py`:
- Around line 655-681: The saves under both the real_quant and else branches
(mtq.compress + model.save_pretrained, processor.save_pretrained,
model.config.save_pretrained, and writing hf_quant_config.json from
get_quant_config) must be guarded so only the main process/rank 0 performs file
writes in distributed runs; detect distributed mode via
torch.distributed.is_initialized() and torch.distributed.get_rank() (or the
project’s existing is_main_process helper) and wrap all save_pretrained calls
and the json file write in a conditional that only runs on rank 0 to avoid
concurrent writes to args.output_dir.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81978cc3-c66c-4b24-8be8-fe303abf7d0d

📥 Commits

Reviewing files that changed from the base of the PR and between 905259f and 6cc978e.

⛔ Files ignored due to path filters (1)

examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquet is excluded by !**/*.parquet

📒 Files selected for processing (2)

examples/alpamayo/README.md
examples/alpamayo/quantize.py

codecov · 2026-06-01T23:03:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.78%. Comparing base (01dec93) to head (bfcb034).

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1594       +/-   ##
===========================================
+ Coverage   62.63%   77.78%   +15.15%     
===========================================
  Files         481      482        +1     
  Lines       52893    52961       +68     
===========================================
+ Hits        33127    41196     +8069     
+ Misses      19766    11765     -8001

Flag	Coverage Δ
examples	`43.12% <ø> (+7.76%)`	⬆️
unit	`53.92% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv

Bot review — DM the bot to share feedback.

New ~700-line example. A few real concerns:

Likely bug — backslash-escaped special tokens. In create_message, the f-strings use <\|traj_history_start\|>, <\|traj_history\|>, <\|traj_history_end\|>, <\|cot_start\|>. In Python \| is not a recognized escape sequence, so the literal string contains a backslash before each pipe. Qwen-family special tokens are <|...|> without backslashes — passing the backslash form to apply_chat_template will not match the tokenizer's added special tokens and these will tokenize as ordinary text, silently corrupting calibration. Looks like a copy-paste from a markdown-escaped source. (Bonus: these will start emitting SyntaxWarning: invalid escape sequence on newer Python.)
Reaches into private API instead of using the public one. enable_huggingface_checkpointing_patch() re-implements mto.enable_huggingface_checkpointing() by importing the private _LIBRARY_CLASSES_FOR_PATCHING, _PATCHED_CLASSES, and patch_pretrained_methods from modelopt.torch.opt.plugins.huggingface, just to skip the _from_config patch. If the only goal is to skip _from_config, please make that an option on the public API (or document why this divergence is needed) instead of duplicating the loop with private symbols. Also, the file does this at import time as a top-level side effect, which is surprising.
Monkey-patching AlpamayoR1.teacher_forced_flow_loss_forward at module import time is acknowledged in a docstring but it's a strong coupling: the body is a verbatim copy of an internal training-fork method, and any drift in nvlabs/alpamayo (e.g. action_in_proj signature, expert_non_causal_attention, action_space API) will silently produce wrong calibration. At minimum: assert that the public AlpamayoR1 exposes the methods/attrs this body relies on, and pin a known-good Alpamayo SHA in the README setup section instead of git clone of main.
read_clip_ids_from_parquet doesn't match its docstring. The docstring says it "tries common column names; falls back to index if needed", but the implementation hard-codes cols_lower["key"] and will KeyError on any parquet without a literal key column. Either implement the documented fallback or simplify the docstring.
Binary parquet checked into the repo. A .parquet calibration set is being added to examples/alpamayo/. The repo doesn't appear to use git-LFS (.gitattributes is missing). Even if this file is small, distributing calibration data through the example tree is unusual for this repo — most examples download from HF on demand. Please confirm the file size is small and that there are no licensing/redistribution restrictions on the AV calibration clips list, or fetch it at runtime instead.
No tests / no CHANGELOG entry. The PR template's test and changelog checkboxes are still unfilled. New examples in this repo (vlm_ptq, etc.) typically don't get unit tests, but at least a CHANGELOG entry under "New Features" is expected.
Minor: args.parquet default is the bare filename, joined to script_dir; if the user passes an absolute path it works (pathlib handles that), but a plain relative path in the user's CWD will be silently resolved against the script's directory, which is surprising. Consider documenting this or using Path(args.parquet) as-is when absolute/exists, else fall back to script_dir / args.parquet.

github-actions · 2026-06-02T21:09:10Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1594/
Built to branch `gh-pages` at 2026-06-05 00:30 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/alpamayo/quantize.py (1)
257-257: 💤 Low value

Consider moving the patch call into main().

Executing patch_teacher_forced_flow_loss_forward() at module import is a side effect that runs even if users only import a helper function. Moving this call into main() keeps imports side-effect-free and aligns with the principle applied to the HF checkpointing setup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/alpamayo/quantize.py` at line 257, The top-level call to
patch_teacher_forced_flow_loss_forward() causes a side effect at import; remove
that call from module scope and invoke patch_teacher_forced_flow_loss_forward()
from inside main() (e.g., at the start of the main() function before any
training or model code runs) so imports remain side-effect-free; reference the
existing patch_teacher_forced_flow_loss_forward function and ensure the patch is
applied once (idempotently) before any code that relies on the patched behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/opt/plugins/huggingface.py`:
- Around line 171-176: Validate the skip_methods input at the function boundary:
compute the set of valid method names by iterating _LIBRARY_CLASSES_FOR_PATCHING
and collecting the first element of each entry in the patch_methods lists, then
if any name in skip_methods is not in that valid set raise an error (e.g.
ValueError) instead of silently ignoring it; keep the existing logic that
converts skip_methods to a set and then continue to call
patch_pretrained_methods(cls, [m for m in patch_methods if m[0] not in
skip_methods]) for each class, while referencing skip_methods,
_LIBRARY_CLASSES_FOR_PATCHING, patch_pretrained_methods, and _PATCHED_CLASSES to
locate and update the code.

---

Nitpick comments:
In `@examples/alpamayo/quantize.py`:
- Line 257: The top-level call to patch_teacher_forced_flow_loss_forward()
causes a side effect at import; remove that call from module scope and invoke
patch_teacher_forced_flow_loss_forward() from inside main() (e.g., at the start
of the main() function before any training or model code runs) so imports remain
side-effect-free; reference the existing patch_teacher_forced_flow_loss_forward
function and ensure the patch is applied once (idempotently) before any code
that relies on the patched behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eecdd970-10bf-41e4-a1ca-8f334dd1dacd

📥 Commits

Reviewing files that changed from the base of the PR and between 6cc978e and 0a9c5ca.

⛔ Files ignored due to path filters (1)

examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquet is excluded by !**/*.parquet

📒 Files selected for processing (4)

CHANGELOG.rst
examples/alpamayo/README.md
examples/alpamayo/quantize.py
modelopt/torch/opt/plugins/huggingface.py

✅ Files skipped from review due to trivial changes (2)

examples/alpamayo/README.md
CHANGELOG.rst

cjluo-nv

Bot review — DM the bot to share feedback.

Several critical points from the previous round are addressed (backslash special tokens fixed, private-API duplication replaced with a public skip_methods arg, README now pins an Alpamayo SHA, CHANGELOG entry added, parquet docstring rewritten to match the code), but a few correctness/UX items raised earlier are still open in this commit:

💬 forward_step / loss_func debug print(...item()...) calls (raised by CodeRabbit and not addressed) — they still run unconditionally on every AutoQuantize iteration and force a per-batch CPU↔GPU sync. args.debug is even hardcoded to True at L573, so there's no way to turn them off without editing the script.
💬 auto_quantize_model disabled_layers (raised by cjluo-nv) still appends bare module names (_name) while quantize_model uses f"{_name}.*". auto_quantize's disabled_layers are fnmatch-matched against quantizer names like vlm.foo.bar.weight_quantizer, so the bare-name entries here will not match — the not-multiple-of-16 Linears that were the motivation for this exclusion are still being included in the AutoQuantize search. This is a real bug, not a style nit.
💬 Module-level patch_teacher_forced_flow_loss_forward() at import (cjluo-nv): a hasattr-on-the-class guard was added, but the request was for hasattr checks on the attributes the patch body depends on (action_in_proj, expert_non_causal_attention, fuse_traj_tokens, action_space.traj_to_action, vlm.model.rope_deltas, …) so upstream drift fails loudly instead of silently producing wrong calibration. Still not added, and the call is still a top-level side effect.
New (PR-internal) issue: enable_huggingface_checkpointing(skip_methods=...) silently ignores unknown method names — a typo in {"_from_config"} would turn this escape hatch into a no-op without warning. Worth either validating against the registered method names or at least warning.
No tests / no example_tests.yml entry for the new example, and a binary .parquet calibration set is still committed in-tree (no .gitattributes / LFS in this repo). The other VLM examples fetch calibration data on demand; please confirm size + redistribution rights, or fetch at runtime.

meenchen · 2026-06-04T05:23:03Z

+Alpamayo using ModelOpt. Quantization calibration runs on a small dataset of 16
+AV clips (`0417_16rows_train_set_for_calibration_25.10.parquet`).


Is this a public dataset? I would prefer not to keep this in our repo.

The parquet file contains clip IDs (which can be publicly shared), the actual driving clips are on HF in the publicly available data nvidia/PhysicalAI-Autonomous-Vehicles

meenchen · 2026-06-04T05:50:10Z

+MSE between the action expert's predicted velocity field `v_pred` and the
+target `v_target = x_1 - x_0` from a teacher-forced forward pass on the
+calibration clips. Layers the loss is sensitive to keep more bits (FP8); the
+rest go to NVFP4.


Is there any inference runtime available for the quantized checkpoint?

Not that I'm aware of, there are just some PyTorch eval scripts lying around in different repos. I have validated accuracy of the FP8, NVFP4, and AutoQuant (6.5 bpe) checkpoints.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

♻️ Duplicate comments (1)

examples/alpamayo/quantize.py (1)

457-472: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the per-batch tensor stats from the AutoQuantize hot path.

These print(...item()...) calls run on every search step. On CUDA they force repeated CPU-GPU syncs and add a lot of stdout overhead. Either remove them from the steady-state loop or gate them behind an explicit debug + rank-0 path.

⚡ Proposed fix

     def forward_step(runtime_model, data):
         with torch.autocast("cuda", dtype=torch.bfloat16):
-            out = runtime_model.teacher_forced_flow_loss_forward(data=data)
-        v_pred, v_target = out["v_pred"], out["v_target"]
-        print(
-            f"[autoquant-fwd] v_pred: finite={torch.isfinite(v_pred).all().item()} "
-            f"min={v_pred.min().item():.4g} max={v_pred.max().item():.4g} "
-            f"abs_mean={v_pred.abs().mean().item():.4g} | "
-            f"v_target: finite={torch.isfinite(v_target).all().item()} "
-            f"min={v_target.min().item():.4g} max={v_target.max().item():.4g}"
-        )
-        return out
+            return runtime_model.teacher_forced_flow_loss_forward(data=data)
 
     def loss_func(output, batch):
-        loss = torch.nn.functional.mse_loss(output["v_pred"], output["v_target"])
-        print(f"[autoquant-loss] loss={loss.item():.6g} finite={torch.isfinite(loss).item()}")
-        return loss
+        return torch.nn.functional.mse_loss(output["v_pred"], output["v_target"])

As per coding guidelines "Keep tensor work on the GPU and avoid unnecessary CPU-GPU syncs; use PyTorch tensor ops by default and only extract Python scalars when the CPU needs the value" and "Develop with distributed processing in mind: use print_rank_0 or warn_rank_0 when possible and guard shared side effects against race conditions between ranks".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/alpamayo/quantize.py` around lines 457 - 472, The per-batch print
statements in forward_step and loss_func (inside forward_step and loss_func)
force CPU-GPU syncs and should be removed from the steady-state hot path; either
delete these prints or wrap them behind an explicit debug flag and a rank-0
guard (e.g., if debug and is_rank_0()) so they run only when requested, and
avoid calling .item() on GPU tensors inside the loop (use tensor ops or move a
small sampled/logging batch to CPU outside the hot path). Ensure any retained
logging uses a rank-0-safe helper (print_rank_0/warn_rank_0) and runs rarely
(not every step).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/alpamayo/quantize.py`:
- Around line 355-363: The fallback calibration path hardcodes "cuda:0" when
calling get_dataset_dataloader which causes device-mismatch if the model lives
elsewhere; change it to derive the device from the model (or use the same device
parameter passed into quantize_model) instead of the literal string. Locate the
branch that checks calibration_forward_loop is None and replace the device
argument for get_dataset_dataloader with the model's device (e.g. infer via
model.device or next(model.parameters()).device) or a provided device variable
so calib_dataloader ends up on the same device as the model used by
quantize_model.

---

Duplicate comments:
In `@examples/alpamayo/quantize.py`:
- Around line 457-472: The per-batch print statements in forward_step and
loss_func (inside forward_step and loss_func) force CPU-GPU syncs and should be
removed from the steady-state hot path; either delete these prints or wrap them
behind an explicit debug flag and a rank-0 guard (e.g., if debug and
is_rank_0()) so they run only when requested, and avoid calling .item() on GPU
tensors inside the loop (use tensor ops or move a small sampled/logging batch to
CPU outside the hot path). Ensure any retained logging uses a rank-0-safe helper
(print_rank_0/warn_rank_0) and runs rarely (not every step).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d5962f0-996f-4a23-9264-bc78c251cdb8

📥 Commits

Reviewing files that changed from the base of the PR and between 0a9c5ca and 5a83ff3.

⛔ Files ignored due to path filters (1)

examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquet is excluded by !**/*.parquet

📒 Files selected for processing (4)

CHANGELOG.rst
examples/alpamayo/README.md
examples/alpamayo/quantize.py
modelopt/torch/opt/plugins/huggingface.py

✅ Files skipped from review due to trivial changes (2)

examples/alpamayo/README.md
CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/opt/plugins/huggingface.py

coderabbitai · 2026-06-04T21:25:11Z

+    if calibration_forward_loop is None:
+        if tokenizer is None:
+            raise ValueError("tokenizer must be provided when calibration_forward_loop is None")
+        calib_dataloader = get_dataset_dataloader(
+            tokenizer=tokenizer,
+            batch_size=32,
+            num_samples=512,
+            device="cuda:0",
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't hardcode cuda:0 in the fallback calibration path.

This helper accepts an arbitrary model, but this branch always builds calibration batches on cuda:0. Reusing quantize_model() with a model on another device will hit a device mismatch before calibration starts.

🐛 Proposed fix

if calibration_forward_loop is None: if tokenizer is None: raise ValueError("tokenizer must be provided when calibration_forward_loop is None") + calib_device = str(next(model.parameters()).device) calib_dataloader = get_dataset_dataloader( tokenizer=tokenizer, batch_size=32, num_samples=512, - device="cuda:0", + device=calib_device, ) calibrate_loop = create_forward_loop(dataloader=calib_dataloader)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/alpamayo/quantize.py` around lines 355 - 363, The fallback calibration path hardcodes "cuda:0" when calling get_dataset_dataloader which causes device-mismatch if the model lives elsewhere; change it to derive the device from the model (or use the same device parameter passed into quantize_model) instead of the literal string. Locate the branch that checks calibration_forward_loop is None and replace the device argument for get_dataset_dataloader with the model's device (e.g. infer via model.device or next(model.parameters()).device) or a provided device variable so calib_dataloader ends up on the same device as the model used by quantize_model.

cjluo-nv

Bot review — DM the bot to share feedback.

Most critical issues from earlier rounds are now addressed: the backslash-escape bug in special tokens is fixed, the private-API duplication has been replaced by a real public skip_methods parameter on enable_huggingface_checkpointing, the README pins an Alpamayo SHA (4cda35d), a CHANGELOG entry was added, and the read_clip_ids_from_parquet docstring was rewritten to match the code. A few items still warrant a human reviewer's sign-off:

💬 Author marked the per-batch .item() debug prints in forward_step / loss_func (lines ~462–475) as "Invalid" without rationale. These still run unconditionally and force CPU↔GPU syncs every AutoQuantize step. Impact is bounded (16 clips × num_score_steps), but the rebuttal isn't substantiated — worth a quick human glance to confirm this is intended.
💬 Author marked the auto_quantize_model disabled_layers bare-name comment as "Invalid". I checked _AutoQuantizeBaseSearcher.insert_hparams_after_merge_rules — disabled_layers are fnmatched against name from model.named_modules(), i.e. module names like vlm.foo.bar, not quantizer names. So the bare names here are actually correct and the author's "Invalid" is right; the prior reviewer's note was mistaken. Just flagging so it doesn't get re-litigated.
💬 Module-level monkey-patch of AlpamayoR1.teacher_forced_flow_loss_forward (L257): author argues the pinned SHA in the README is sufficient defense against drift. The body still reaches into action_in_proj, expert_non_causal_attention, fuse_traj_tokens, action_space.traj_to_action, vlm.model.rope_deltas, etc. with no hasattr guards. Reasonable position given the SHA pin, but worth a human sanity check.
💬 meenchen asked whether the in-tree 0417_16rows_train_set_for_calibration_25.10.parquet is a public dataset and expressed a preference not to keep it in the repo. Author responded that the clip IDs are publicly shareable (clips themselves are on HF). meenchen hasn't explicitly accepted that resolution; this is the most important remaining sign-off item — the file is binary, no .gitattributes/LFS in the repo, and other VLM examples fetch calibration data on demand rather than committing it.
enable_huggingface_checkpointing(skip_methods=...) silently ignores unknown method names (a typo turns the escape hatch into a no-op). Author replied "only used for Alpamayo, won't be a no-op in practice", but this is now a public API surface and a one-line ValueError (or warning) on unknown names would be cheap insurance. Non-blocking.
No tests / no example_tests.yml entry. Consistent with how other VLM examples are structured in this repo, so probably fine.

Licensing: new Python files carry the standard NVIDIA Apache-2.0 header; no third-party code copied; no LICENSE/NOTICE changes. The parquet question above is a redistribution / repo-policy concern rather than a license concern.

cjluo-nv · 2026-06-04T23:19:38Z

On the `skip_methods` argument

I'd push back on adding skip_methods to enable_huggingface_checkpointing as the fix here. It works, but it's a fairly leaky/blunt escape hatch:

It's process-global. skip_methods={"_from_config"} disables _from_config restore for every class for the whole run, just to work around one model's constructor. Any legitimate AutoModel.from_config(...)-based restore elsewhere silently stops working.
It leaks internals into the public API. Callers now have to know that the private _from_config method is the thing to skip.
It interacts badly with the _PATCHED_CLASSES cache. Once enabled, a second enable_huggingface_checkpointing(skip_methods=...) call is a silent no-op (classes already in _PATCHED_CLASSES), so the skip set is effectively "first call wins" — a subtle footgun.

The underlying issue: AlpamayoR1.__init__ builds its action expert via AutoModel.from_config(...), which re-enters the patched _from_config restore machinery during the parent's __init__. That nested re-entry is what conflicts.

Two alternatives I'd suggest considering instead:

1. No library change — reorder in the example. The base checkpoint has no modelopt_state.pth, so the restore hook does nothing useful on the initial load; only save_pretrained needs patching here. So:

model = AlpamayoR1.from_pretrained(args.ckpt, ...)   # load clean, no patches active
mto.enable_huggingface_checkpointing()               # enable AFTER load
# quantize -> save_pretrained writes modelopt_state.pth

Does quantize.py ever rely on restore-during-load? Reloading the quantized checkpoint looks like a separate deployment step. If load-time restore isn't needed in this script, skip_methods can be dropped entirely.

2. If restore-on-load is needed — fix the re-entrancy in the library instead. Add a thread-local "restore in progress" guard in _patch_model_init_for_modelopt so only the outermost from_pretrained/_from_config triggers a restore, and nested sub-module construction skips it. That fixes the problem for any composite/VLM model that builds sub-models via AutoModel.from_config — not just Alpamayo — without exposing a per-caller knob or requiring knowledge of internal method names.

Happy to help draft the re-entrancy guard if option 2 sounds right.

Minor, unrelated: this PR commits a binary .parquet calibration file into the repo — worth double-checking whether that should be hosted/downloaded rather than checked into git, for repo-bloat reasons.

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

rohansjoshi requested a review from a team as a code owner June 1, 2026 22:49

rohansjoshi requested a review from jingyu-ml June 1, 2026 22:49

rohansjoshi requested a review from cjluo-nv June 1, 2026 22:50

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread examples/alpamayo/quantize.py Outdated

Comment thread examples/alpamayo/quantize.py

Comment thread examples/alpamayo/quantize.py

cjluo-nv reviewed Jun 1, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 2, 2026

View reviewed changes

rohansjoshi force-pushed the rohjoshi/alpamayo-example branch from 6cc978e to 0a9c5ca Compare June 2, 2026 21:04

rohansjoshi requested a review from a team as a code owner June 2, 2026 21:04

rohansjoshi requested a review from meenchen June 2, 2026 21:04

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread modelopt/torch/opt/plugins/huggingface.py Outdated

cjluo-nv reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/alpamayo/quantize.py

Comment thread examples/alpamayo/quantize.py

Comment thread examples/alpamayo/quantize.py

Comment thread modelopt/torch/opt/plugins/huggingface.py

Comment thread examples/alpamayo/quantize.py

Comment thread examples/alpamayo/quantize.py

meenchen reviewed Jun 4, 2026

View reviewed changes

rohansjoshi force-pushed the rohjoshi/alpamayo-example branch from 0a9c5ca to 5a83ff3 Compare June 4, 2026 21:17

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

cjluo-nv reviewed Jun 4, 2026

View reviewed changes

rohansjoshi added 3 commits June 5, 2026 00:21

First commit

43503cc

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

Fixed real quant path

470baa1

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

Added README and calibration clips

3514a2e

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

rohansjoshi force-pushed the rohjoshi/alpamayo-example branch from 5a83ff3 to 7886c59 Compare June 5, 2026 00:27

Addressed review comments

bfcb034

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

rohansjoshi force-pushed the rohjoshi/alpamayo-example branch from 7886c59 to bfcb034 Compare June 5, 2026 00:52

		Alpamayo using ModelOpt. Quantization calibration runs on a small dataset of 16
		AV clips (`0417_16rows_train_set_for_calibration_25.10.parquet`).

Conversation

rohansjoshi commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-06-05 00:30 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

meenchen Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

rohansjoshi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meenchen Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

rohansjoshi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented Jun 4, 2026

rohansjoshi commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-06-05 00:30 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

On the `skip_methods` argument