Skip to content

Commit 361f7e3

Browse files
danielkorzekwakevalmorabia97LianaMikaelj-rauschclaude
authored
Merge puzzletron compression algorithm (#1121)
### What does this PR do? Implement puzzletron compression algorithm based on Puzzle paper (https://arxiv.org/abs/2411.19146) <details> <summary> Th list of reviewed and merged MRs that resulted in the feature/puzzletron branch</summary> Merging dkorzekwa/any_model to feature/puzzletron [Add anymodel directories to feature/puzzletron by danielkorzekwa · Pull Request #974 · NVIDIA/Model-Optimizer](#974) - merged [Draft: anymodel activation scoring by danielkorzekwa · Pull Request #989 · NVIDIA/Model-Optimizer](#989) - merged [Draft: Merge anymodel pruning by danielkorzekwa · Pull Request #990 · NVIDIA/Model-Optimizer](#990) - merged [Draft: Merging anymodel:build_library_and_stats by danielkorzekwa · Pull Request #993 · NVIDIA/Model-Optimizer](#993) - merged [Dkorzekwa/any model calc one block scores by danielkorzekwa · Pull Request #994 · NVIDIA/Model-Optimizer](#994) - merged [Draft: merge any_model: mip_and_realize_models by danielkorzekwa · Pull Request #995 · NVIDIA/Model-Optimizer](#995) - merged [Dkorzekwa/any model other modeqls by danielkorztiekwa · Pull Request #1007 · NVIDIA/Model-Optimizer](#1007) - merged PR to 1007: #1039 - merged [Dkorzekwa/anymodel gptoss by danielkorzekwa · Pull Request #1020 · NVIDIA/Model-Optimizer](#1020) - merged [Merge any_model tutorial by danielkorzekwa · Pull Request #1035 · NVIDIA/Model-Optimizer](#1035) - merged [Merge mbridge distillation for any_model by danielkorzekwa · Pull Request #1036 · NVIDIA/Model-Optimizer](#1036) - merged [MR branch for the remaining difference between dkorzekwa/any_model an… by danielkorzekwa · Pull Request #1047 · NVIDIA/Model-Optimizer](#1047) - merged [Dkorzekwa/decilm hf code cleanup by danielkorzekwa · Pull Request #1071 · NVIDIA/Model-Optimizer](#1071) - merged [Dkorzekwa/decilm hf code cleanup 2 by danielkorzekwa · Pull Request #1073 · NVIDIA/Model-Optimizer](#1073) - merged [Dkorzekwa/anymodel subblock stats by danielkorzekwa · Pull Request #1085 · NVIDIA/Model-Optimizer](#1085) - merged [Dkorzekwa/anymodel subblock stats nodecilm by danielkorzekwa · Pull Request #1102 · NVIDIA/Model-Optimizer](#1102) - merged [Dkorzekwa/decilm cleanup post subblockstats by danielkorzekwa · Pull Request #1103 · NVIDIA/Model-Optimizer](#1103) - merged [code clean up by danielkorzekwa · Pull Request #1110 · NVIDIA/Model-Optimizer](#1110) - merged Merging into main: [Activation hooks redesign (reuse hooks component across both minitron and puzzletron) by danielkorzekwa · Pull Request #1022 · NVIDIA/Model-Optimizer](#1022) - merged [Dkorzekwa/puzzletron use importance hooks from prune by danielkorzekwa · Pull Request #1115 · NVIDIA/Model-Optimizer](#1115) - merged </details> <!-- Details about the change. --> ### Usage Puzzletron tutorial: https://github.com/NVIDIA/Model-Optimizer/tree/feature/puzzletron/examples/puzzletron ### Testing The main e2e test for compressing 9 models with Puzzletron: https://github.com/NVIDIA/Model-Optimizer/blob/feature/puzzletron/tests/gpu/torch/puzzletron/test_puzzletron.py 2-gpu nightly tests: - https://github.com/NVIDIA/Model-Optimizer/actions/runs/24468209205/job/71501061203 - https://github.com/NVIDIA/Model-Optimizer/actions/runs/24470214159/job/71508152952 ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Puzzletron: end-to-end heterogeneous pruning & NAS workflow with AnyModel support, example pipelines, deployment and evaluation utilities, and tools for converting/pruning and exporting compressed checkpoints. * **Documentation** * Comprehensive Puzzletron tutorials, model-specific guides, evaluator instructions, example configs, and changelog entry. * **Chores** * CI/workflow updates (extras installation, longer GPU test timeout), pre-commit hook exclusion updated, and CODEOWNERS entries added. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Signed-off-by: Liana Mikaelyan <lmikaelyan@nvidia.com> Signed-off-by: Liana Mikaelyan <45925959+LianaMikael@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <daniel.korzekwa@gmail.com> Signed-off-by: jrausch <jrausch@nvidia.com> Signed-off-by: root <root@pool0-00848.cm.cluster> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Liana Mikaelyan <lmikaelyan@nvidia.com> Co-authored-by: Liana Mikaelyan <45925959+LianaMikael@users.noreply.github.com> Co-authored-by: J Rausch <38429553+j-rausch@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent dec2952 commit 361f7e3

235 files changed

Lines changed: 24885 additions & 166 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
2525
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2626
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
27+
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
2728
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2829
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
2930
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
@@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
4950
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
5051
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
5152
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
53+
/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
5254
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
5355
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
5456
/examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners

.github/workflows/_example_tests_runner.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ jobs:
4848
- name: Install dependencies
4949
run: |
5050
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
51+
pip uninstall -y nvidia-modelopt
5152
python -m pip install ".${{ inputs.pip_install_extras }}"
5253
5354
if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
@@ -64,7 +65,7 @@ jobs:
6465
COVERAGE_FILE: ${{ github.workspace }}/.coverage
6566
run: |
6667
echo "Running tests for: ${{ inputs.example }}"
67-
pytest tests/examples/${{ inputs.example }} --cov
68+
python -m pytest tests/examples/${{ inputs.example }} --cov
6869
- name: Upload coverage to Codecov
6970
uses: codecov/codecov-action@v5
7071
with:

.github/workflows/example_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ jobs:
132132
docker_image: "nvcr.io/nvidia/nemo:26.02"
133133
example: ${{ matrix.example }}
134134
timeout_minutes: 30
135-
pip_install_extras: "[hf,dev-test]"
135+
pip_install_extras: "[hf,puzzletron,dev-test]"
136136
runner: linux-amd64-gpu-rtxpro6000-latest-1
137137

138138
nemo-non-pr:
@@ -144,7 +144,7 @@ jobs:
144144
docker_image: "nvcr.io/nvidia/nemo:26.02"
145145
example: ${{ matrix.example }}
146146
timeout_minutes: 30
147-
pip_install_extras: "[hf,dev-test]"
147+
pip_install_extras: "[hf,puzzletron,dev-test]"
148148
runner: linux-amd64-gpu-rtxpro6000-latest-2
149149

150150
##### ONNX/TensorRT Example Tests #####

.github/workflows/gpu_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
matrix:
6969
include:
7070
- example: gpu
71-
timeout: 45
71+
timeout: 60
7272
container_image: pytorch:26.01-py3
7373
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
7474
- example: gpu-regression

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ repos:
9494
modelopt/onnx/quantization/ort_patching.py|
9595
modelopt/torch/_deploy/utils/onnx_utils.py|
9696
modelopt/torch/export/transformer_engine.py|
97+
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
9798
modelopt/torch/quantization/export_onnx.py|
9899
modelopt/torch/quantization/plugins/attention.py|
99100
modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Changelog
77
**New Features**
88

99
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
10+
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
1011
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
1112
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
1213
- Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.

docs/source/conf.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
# import sys
3232
# sys.path.insert(0, os.path.abspath('.'))
3333

34+
import contextlib
3435
import os
3536
import sys
3637

@@ -44,6 +45,14 @@
4445
sys.path.insert(0, os.path.abspath("../../"))
4546
sys.path.append(os.path.abspath("./_ext"))
4647

48+
# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
49+
# autodoc_mock_imports. Mocking triton/tensorrt_llm at the Sphinx level can break
50+
# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
51+
# to fail inside autosummary. Importing here — while the real packages are still on
52+
# sys.path — avoids that problem entirely.
53+
with contextlib.suppress(Exception):
54+
import modelopt.torch # noqa: F401
55+
4756
# -- Project information -----------------------------------------------------
4857

4958
project = "Model Optimizer" # pylint: disable=C0103

examples/llm_eval/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
4040
--batch_size 4
4141
```
4242

43+
### Heterogeneous Pruned Checkpoints (Puzzletron)
44+
45+
Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:
46+
47+
```sh
48+
python lm_eval_hf.py --model hf \
49+
--model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
50+
--tasks mmlu \
51+
--num_fewshot 5 \
52+
--batch_size 4
53+
```
54+
55+
For a quick smoke test, add `--limit 10`.
56+
57+
> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).
58+
4359
### Quantized (simulated)
4460

4561
- For simulated quantization with any of the default quantization formats:

examples/llm_eval/lm_eval_hf.py

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,19 @@
3636
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
3737
# See the License for the specific language governing permissions and
3838
# limitations under the License.
39+
import contextlib
3940
import warnings
4041

4142
import datasets
43+
import lm_eval
4244
from lm_eval import utils
4345
from lm_eval.__main__ import cli_evaluate, parse_eval_args, setup_parser
46+
47+
if not lm_eval.__version__.startswith("0.4.8"):
48+
warnings.warn(
49+
f"lm_eval_hf.py is tested with lm-eval 0.4.8; found {lm_eval.__version__}. "
50+
"Later versions may have incompatible API changes."
51+
)
4452
from lm_eval.api.model import T
4553
from lm_eval.models.huggingface import HFLM
4654
from quantization_utils import quantize_model
@@ -50,9 +58,29 @@
5058
from modelopt.torch.quantization.utils import is_quantized
5159
from modelopt.torch.sparsity.attention_sparsity.conversion import is_attn_sparsified
5260

61+
try:
62+
import modelopt.torch.puzzletron as mtpz
63+
64+
_ANYMODEL_AVAILABLE = True
65+
except ImportError:
66+
_ANYMODEL_AVAILABLE = False
67+
68+
69+
def _anymodel_patcher_context(pretrained, trust_remote_code=False):
70+
"""Return a deci_x_patcher context if *pretrained* is a Puzzletron checkpoint, else a no-op."""
71+
if not _ANYMODEL_AVAILABLE or not pretrained:
72+
return contextlib.nullcontext()
73+
try:
74+
descriptor = mtpz.anymodel.resolve_descriptor_from_pretrained(
75+
pretrained, trust_remote_code=trust_remote_code
76+
)
77+
except (ValueError, AttributeError):
78+
return contextlib.nullcontext()
79+
return mtpz.anymodel.deci_x_patcher(model_descriptor=descriptor)
80+
5381

5482
def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
55-
"""Overrides the HFLM.create_from_arg_obj"""
83+
"""Override HFLM.create_from_arg_obj to add quantization, sparsity, and Puzzletron support."""
5684

5785
quant_cfg = arg_dict.pop("quant_cfg", None)
5886
auto_quantize_bits = arg_dict.pop("auto_quantize_bits", None)
@@ -72,7 +100,10 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
72100
# Enable automatic save/load of modelopt state huggingface checkpointing
73101
mto.enable_huggingface_checkpointing()
74102

75-
model_obj = cls(**arg_dict, **additional_config)
103+
with _anymodel_patcher_context(
104+
arg_dict.get("pretrained"), arg_dict.get("trust_remote_code", False)
105+
):
106+
model_obj = cls(**arg_dict, **additional_config)
76107
model_obj.tokenizer.padding_side = "left"
77108
if is_quantized(model_obj.model):
78109
# return if model is already quantized
@@ -109,10 +140,28 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
109140
return model_obj
110141

111142

143+
def create_from_arg_string(
144+
cls: type[T], arg_string: str, additional_config: dict | None = None
145+
) -> T:
146+
"""Override HFLM.create_from_arg_string to support Puzzletron checkpoints."""
147+
args = utils.simple_parse_args_string(arg_string)
148+
additional_config = {} if additional_config is None else additional_config
149+
args2 = {k: v for k, v in additional_config.items() if v is not None}
150+
151+
mto.enable_huggingface_checkpointing()
152+
153+
with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
154+
model_obj = cls(**args, **args2)
155+
156+
return model_obj
157+
158+
112159
HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
160+
HFLM.create_from_arg_string = classmethod(create_from_arg_string)
113161

114162

115163
def setup_parser_with_modelopt_args():
164+
"""Extend the lm-eval argument parser with ModelOpt quantization and sparsity options."""
116165
parser = setup_parser()
117166
parser.add_argument(
118167
"--quant_cfg",

examples/megatron_bridge/README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ Note that the default dataset for pruning and quantization is [`nemotron-post-tr
4646
hf auth login --token <your token>
4747
```
4848

49+
> [!WARNING]
50+
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers.
51+
4952
## Pruning
5053

5154
This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).
@@ -92,7 +95,7 @@ This section shows how to distill a student model from a teacher model in the Me
9295

9396
This can be used stand-alone or after [Pruning](#pruning) / [Post-Training Quantization](#post-training-quantization) to recover accuracy of the model by distilling from the original model (teacher).
9497

95-
The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
98+
The [distill.py](distill.py) script supports both standard HuggingFace checkpoints and [Puzzletron AnyModel](../puzzletron/README.md) checkpoints as student/teacher inputs. Just pass the checkpoint path via `--student_hf_path` / `--teacher_hf_path`. The distilled model is saved to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
9699

97100
### Data Preparation
98101

@@ -158,9 +161,22 @@ torchrun --nproc_per_node 8 distill.py \
158161

159162
To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
160163

161-
### Convert Megatron checkpoint to Hugging Face format
164+
### Converting to Hugging Face format (optional)
165+
166+
The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:
167+
168+
**Inline** -- add `--hf_export_path` and `--student_hf_model` to the `distill.py` command to automatically convert the final checkpoint after distillation:
169+
170+
```bash
171+
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
172+
... \
173+
--hf_export_path /path/to/save/distilled_hf_ckpt \
174+
--student_hf_model Qwen/Qwen3-4B
175+
```
176+
177+
`--student_hf_model` should match the base architecture of the student (used as a template for export). For non-Puzzletron (i.e. standard) models, it should be same as `--student_hf_path`.
162178

163-
To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
179+
**Separate conversion** -- convert any saved iteration using the Megatron-Bridge conversion script:
164180

165181
```bash
166182
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
@@ -169,7 +185,11 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
169185
--hf-path <path_to_save_distilled_hf_ckpt>
170186
```
171187

172-
For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
188+
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
189+
190+
### Distillation Results
191+
192+
See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
173193

174194
## Post-Training Quantization
175195

0 commit comments

Comments
 (0)