NVIDIA
diff --git a/‎.github/workflows/example_tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/example_tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/megatron_bridge/README.md‎
Lines changed: 43 additions & 5 deletions b/‎examples/megatron_bridge/README.md‎
Lines changed: 43 additions & 5 deletions
diff --git a/‎examples/megatron_bridge/prune_minitron.py‎
Lines changed: 107 additions & 47 deletions b/‎examples/megatron_bridge/prune_minitron.py‎
Lines changed: 107 additions & 47 deletions
@@ -86,7 +86,7 @@ jobs:
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
-      docker_image: "nvcr.io/nvidia/nemo:26.02"
+      docker_image: "nvcr.io/nvidia/nemo:26.04"
       example: megatron_bridge
       timeout_minutes: 30
       pip_install_extras: "[hf,puzzletron,dev-test]"
 
@@ -18,6 +18,7 @@ Changelog
 **New Features**
 
 - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
+- Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.
 - Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
 - DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
 
 
@@ -16,7 +16,7 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
 
 ## Pre-Requisites
 
-Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
+Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.04`) which has all the dependencies installed.
 
 To get the ModelOpt examples scripts, mount your Model-Optimizer repo to the container as follows:
 
@@ -26,7 +26,7 @@ if [ ! -d "${MODELOPT_DIR}" ]; then
   git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
 fi
 
-export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
+export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.04
 docker run \
   --gpus all \
   --shm-size=16GB \
@@ -49,11 +49,28 @@ hf auth login --token <your token>
 > [!WARNING]
 > Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers. You may also refer to this [doc](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docker/common/README.md#installing-packages-inside-the-container) on how to correctly install packages in the NeMo containers without breaking existing torch installation.
 
+Also install additional dependencies from the [requirements.txt](./requirements.txt) file.
+
+```bash
+python -m pip install -r requirements.txt
+```
+
 ## Pruning
 
 This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).
 
-Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults:
+The script supports three NAS-based pruning targets and one manual export mode:
+
+| Mode | Flag | Description |
+| :---: | :---: | :--- |
+| NAS | `--prune_target_params` | Prune to a target total parameter count |
+| NAS | `--prune_target_active_params` | Prune to a target active parameter count (useful for MoE models). For non-MoE models, this is equivalent to `--prune_target_params`. |
+| NAS | `--prune_target_memory_mb` | Prune to a target memory footprint in MB (weights + KV-cache) for a given batch size and sequence length assuming BF16 precision |
+| Manual | `--prune_export_config` | Prune directly to a specified architecture config (no NAS). Useful if you want to take top K candidates and do a short knowledge distillation before selecting the best model. |
+
+Multiple NAS targets can be combined — e.g. `--prune_target_params 6e9 --prune_target_memory_mb 12288` finds the best model with under 6B params and under 12GB memory footprint at (default) batch size 1 and sequence length 4096 assuming BF16 precision.
+
+**Prune by total parameter count** — prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults:
     1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration,
     at-most 20% depth (`num_layers`) and 40% width is pruned per prunable hparam (`hidden_size`, `ffn_hidden_size`, ...),
     top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
@@ -67,8 +84,29 @@ torchrun --nproc_per_node 2 prune_minitron.py \
     --output_hf_path /tmp/Qwen3-8B-Pruned-6B
 ```
 
-Example usage for manually pruning to a specific architecture using following defaults:
-    1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.
+**Prune by active parameter count** — useful for MoE models where most experts are inactive per token (e.g. prune Nemotron-3-Nano-30B-A3B-BF16 (3.6B active params) to 3B active params):
+
+```bash
+torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
+    --hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --prune_target_active_params 3e9 \
+    --output_hf_path /tmp/Nemotron-3-Nano-30B-A3B-BF16-Pruned-3B-Active
+```
+
+**Prune by memory footprint** — prune to fit a target GPU memory budget (weights + KV-cache at the given sequence length and batch size, assuming BF16):
+
+```bash
+torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
+    --hf_model_name_or_path Qwen/Qwen3-8B \
+    --prune_target_memory_mb 12288 \
+    --seq_length 4096 \
+    --calib_mbs 1 \
+    --output_hf_path /tmp/Qwen3-8B-Pruned-12GB
+```
+
+**Manual pruning** — prune directly to a specified architecture (no NAS, no score evaluation):
 
 ```bash
 torchrun --nproc_per_node 2 prune_minitron.py \
 
@@ -14,6 +14,11 @@
 # limitations under the License.
 """Example script for pruning a GPT / Mamba model using Minitron algorithm on a Megatron-Bridge model (load from HF).
 
+Supports three NAS-based pruning targets (can be combined):
+  --prune_target_params       Total parameter count (e.g. 6e9 for 6B total params)
+  --prune_target_active_params Active parameter count for MoE models (e.g. 3e9 for 3B active params)
+  --prune_target_memory_mb    Memory footprint in MB (uses --seq_length for KV-cache estimate, assumes BF16)
+
 Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2)
 while skipping pruning of num_attention_heads using following defaults:
     1024 samples from nemotron-post-training-dataset-v2 for calibration,
@@ -47,7 +52,7 @@
 import modelopt.torch.opt as mto
 import modelopt.torch.prune as mtp
 import modelopt.torch.utils.distributed as dist
-from modelopt.torch.utils import get_supported_datasets, num2hrb, print_rank_0, warn_rank_0
+from modelopt.torch.utils import get_supported_datasets, print_rank_0, warn_rank_0
 from modelopt.torch.utils.plugins.mbridge import (
     get_hf_mbridge_calibration_loop,
     load_mbridge_model_from_hf,
@@ -105,7 +110,6 @@ def get_args() -> argparse.Namespace:
     )
     parser.add_argument("--calib_gbs", type=int, default=1, help="Calibration global batch size")
     parser.add_argument("--seq_length", type=int, default=4096)
-
     # Pruning parameters
     parser.add_argument(
         "--prune_intermediate_ckpt",
@@ -117,42 +121,70 @@ def get_args() -> argparse.Namespace:
         ),
     )
 
-    target_group = parser.add_mutually_exclusive_group(required=True)
-    target_group.add_argument(
+    parser.add_argument(
         "--prune_export_config",
         type=str,
         help=(
             'Target pruned config as JSON e.g., \'{"hidden_size": 512, "ffn_hidden_size": 2048}\'. '
             f"Supported hyperparameters: {mtp.mcore_minitron.SUPPORTED_HPARAMS}. "
-            "Cannot be used with --prune_target_params."
+            "Cannot be combined with NAS-based targets."
         ),
     )
-    target_group.add_argument(
+    parser.add_argument(
         "--prune_target_params",
         type=float,
         help=(
-            "Target parameter count for pruning e.g., 6e9 for pruning to 6B params (total params, not active params). "
-            "Uses Neural Architecture Search (NAS) to find the best pruned model that maximizes the --prune_score_func."
-            "Cannot be used with --prune_export_config."
+            "Target total parameter count e.g., 6e9 for 6B params. "
+            "Uses NAS to find the best pruned model that maximizes --prune_score_func. "
+            "Can be combined with --prune_target_active_params and/or --prune_target_memory_mb."
+        ),
+    )
+    parser.add_argument(
+        "--prune_target_active_params",
+        type=float,
+        help=(
+            "Target active parameter count e.g., 3e9 for 3B active params (useful for MoE models). "
+            "Uses NAS to find the best pruned model that maximizes --prune_score_func. "
+            "Can be combined with --prune_target_params and/or --prune_target_memory_mb."
+        ),
+    )
+    parser.add_argument(
+        "--prune_target_memory_mb",
+        type=float,
+        help=(
+            "Target memory footprint in MB (weights + KV-cache estimated via seq_length and "
+            "--inference_batch_size; assumes BF16). "
+            "Uses NAS to find the best pruned model that maximizes --prune_score_func. "
+            "Can be combined with --prune_target_params and/or --prune_target_active_params."
+        ),
+    )
+    parser.add_argument(
+        "--inference_batch_size",
+        type=int,
+        default=None,
+        help=(
+            "Batch size used only for KV-cache sizing in --prune_target_memory_mb. "
+            "Defaults to --calib_mbs when not set. "
+            "Use this to target an inference batch size that differs from the calibration micro-batch size."
         ),
     )
 
     parser.add_argument(
         "--prune_score_func",
         type=str,
-        default="mmlu_10pct",
+        default="mmlu_10pct_bs1",
         help=(
-            "Score function to use for NAS-based pruning (--prune_target_params). Only supports MMLU at the moment. "
-            "Format: mmlu_<N>pct where <N> is the percentage of MMLU data to sample per subject "
-            "(e.g. mmlu_10pct for 10%, mmlu_100pct for full eval)."
+            "Score function to use for NAS-based pruning. Only supports MMLU at the moment. "
+            "Format: mmlu_<N>pct_<bs> where <N> is the percentage of MMLU data to sample per subject and <bs> is "
+            "batch size for fast evaluation (default is mmlu_10pct_bs1)."
         ),
     )
     parser.add_argument(
         "--ss_channel_divisor",
         type=int,
         default=None,
         help=(
-            "hidden_size / ffn_hidden_size divisor for NAS-based pruning (--prune_target_params). "
+            "hidden_size / ffn_hidden_size divisor for NAS-based pruning. "
             "Leave as None to use default divisors."
         ),
     )
@@ -162,14 +194,14 @@ def get_args() -> argparse.Namespace:
         default=0.4,
         help=(
             f"Maximum width pruning percentage ({mtp.mcore_minitron.SUPPORTED_HPARAMS - {'num_layers'}}) "
-            "for NAS-based pruning (--prune_target_params)"
+            "for NAS-based pruning"
         ),
     )
     parser.add_argument(
         "--max_depth_pruning",
         type=float,
         default=0.2,
-        help="Maximum depth pruning percentage ('num_layers') for NAS-based pruning (--prune_target_params)",
+        help="Maximum depth pruning percentage ('num_layers') for NAS-based pruning",
     )
     parser.add_argument(
         "--hparams_to_skip",
@@ -178,7 +210,7 @@ def get_args() -> argparse.Namespace:
         default=[],
         choices=mtp.mcore_minitron.SUPPORTED_HPARAMS,
         help=(
-            "Space-separated list of hparams to skip for NAS-based pruning (--prune_target_params) "
+            "Space-separated list of hparams to skip for NAS-based pruning "
             "e.g. dont prune 'num_attention_heads'"
         ),
     )
@@ -187,13 +219,27 @@ def get_args() -> argparse.Namespace:
         type=int,
         default=10,
         help=(
-            "Number of top candidates to consider for NAS-based pruning (--prune_target_params). "
+            "Number of top candidates to consider for NAS-based pruning. "
             "Higher values will take longer to prune but may find a better model."
         ),
     )
 
     args = parser.parse_args()
 
+    # Validate pruning target arguments
+    _nas_targets = [
+        args.prune_target_params,
+        args.prune_target_active_params,
+        args.prune_target_memory_mb,
+    ]
+    if args.prune_export_config and any(t is not None for t in _nas_targets):
+        parser.error("--prune_export_config cannot be combined with NAS-based targets.")
+    if not args.prune_export_config and not any(t is not None for t in _nas_targets):
+        parser.error(
+            "At least one of --prune_export_config, --prune_target_params,"
+            " --prune_target_active_params, or --prune_target_memory_mb is required."
+        )
+
     # Post-process arguments
     if args.prune_intermediate_ckpt is None:
         if args.output_megatron_path:
@@ -250,11 +296,6 @@ def main(args: argparse.Namespace):
         init_model_parallel=True,
         moe_grouped_gemm=False,
     )
-    print_rank_0(f"\nPruning model (showing PP rank0): {unwrapped_model}")
-    print_rank_0(
-        f"Original model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
-    )
-
     forward_loop = get_hf_mbridge_calibration_loop(
         model=model,
         provider=provider,
@@ -271,10 +312,20 @@ def main(args: argparse.Namespace):
         "forward_loop": forward_loop,
         "checkpoint": args.prune_intermediate_ckpt,
     }
-    if args.prune_target_params is not None:
-        # Restrict search space to a smaller set of candidates
-        # Allow more choices for MoE FFN as they are generally smaller
-        # NOTE: You can reduce the divisors and increase config['top_k'] to potentially find a better model.
+    if args.prune_export_config is not None:
+        # Less restrictive search space for manual pruning
+        ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
+            hidden_size_divisor=64,
+            ffn_hidden_size_divisor=64,
+            mamba_head_dim_divisor=8,
+            num_moe_experts_divisor=8,
+            num_layers_divisor=1,
+        )
+        pruning_constraints = {"export_config": args.prune_export_config}
+    else:
+        # NAS-based pruning: restrict search space to a smaller set of candidates.
+        # Allow more choices for MoE FFN as they are generally smaller.
+        # NOTE: Reduce divisors and increase config['top_k'] to potentially find a better model.
         hidden_size_divisor = args.ss_channel_divisor if args.ss_channel_divisor else 256
         ffn_hidden_size_divisor = (
             args.ss_channel_divisor
@@ -290,40 +341,53 @@ def main(args: argparse.Namespace):
         )
         print_rank_0(f"Using search space config: {ss_config}")
 
-        pruning_constraints = {"params": args.prune_target_params}
+        pruning_constraints = {}
+        if args.prune_target_params is not None:
+            pruning_constraints["params"] = args.prune_target_params
+        if args.prune_target_active_params is not None:
+            pruning_constraints["active_params"] = args.prune_target_active_params
+        if args.prune_target_memory_mb is not None:
+            pruning_constraints["memory_mb"] = args.prune_target_memory_mb
+
         print_rank_0(
             f"Using NAS-based automatic pruning with score function: {args.prune_score_func}. "
             "You can change this to be any other metric you want to maximize (e.g. negative validation loss)."
         )
 
-        match = re.fullmatch(r"mmlu_(\d+)pct", args.prune_score_func)
-        if not match:
+        match = re.fullmatch(r"mmlu_(\d+)pct_bs(\d+)", args.prune_score_func)
+        legacy_match = re.fullmatch(r"mmlu_(\d+)pct", args.prune_score_func)
+        if match:
+            mmlu_frac = float(match.group(1)) / 100.0
+            batch_size = int(match.group(2))
+        elif legacy_match:
+            warn_rank_0(
+                f"Score function '{args.prune_score_func}' uses the deprecated format "
+                "'mmlu_<N>pct'. Use 'mmlu_<N>pct_bs<bs>' to specify the evaluation batch size. "
+                "Falling back to batch_size=1."
+            )
+            mmlu_frac = float(legacy_match.group(1)) / 100.0
+            batch_size = 1
+        else:
             raise ValueError(
-                f"Invalid score function: {args.prune_score_func}. Expected format: mmlu_<N>pct (e.g. mmlu_10pct)"
+                f"Invalid score function: {args.prune_score_func}. "
+                "Expected format: mmlu_<N>pct_bs<bs> (e.g. mmlu_10pct_bs1)"
             )
-        mmlu_frac = float(match.group(1)) / 100.0
 
         def score_func(m):
             return megatron_mmlu(
-                m, tokenizer, few_shots=0, fraction=mmlu_frac, batch_size=args.calib_mbs
+                m, tokenizer, few_shots=0, fraction=mmlu_frac, batch_size=batch_size
             )
 
         pruning_config["score_func"] = score_func
         pruning_config["max_width_pruning"] = args.max_width_pruning
         pruning_config["max_depth_pruning"] = args.max_depth_pruning
         pruning_config["hparams_to_skip"] = args.hparams_to_skip
         pruning_config["top_k"] = args.top_k
-    elif args.prune_export_config is not None:
-        # Less restrictive search space for manual pruning
-        ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
-            hidden_size_divisor=64,
-            ffn_hidden_size_divisor=64,
-            mamba_head_dim_divisor=8,
-            num_moe_experts_divisor=8,
-            num_layers_divisor=1,
+        # memory_mb constraint requires batch_size and seq_length
+        pruning_config["batch_size"] = (
+            args.inference_batch_size if args.inference_batch_size is not None else args.calib_mbs
         )
-
-        pruning_constraints = {"export_config": args.prune_export_config}
+        pruning_config["seq_length"] = args.seq_length
     print_rank_0(f"Pruning constraints: {pruning_constraints}")
 
     unwrapped_model, pruning_scores = mtp.prune(  # in-place pruning
@@ -343,10 +407,6 @@ def score_func(m):
             else "hybrid_layer_pattern"
         )
         setattr(provider, hybrid_key, getattr(unwrapped_model, hybrid_key))
-    print_rank_0(f"\nPruned model (showing PP rank0): {unwrapped_model}")
-    print_rank_0(
-        f"Pruned model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
-    )
 
     if args.output_megatron_path is not None:
         print_rank_0(