Skip to content

Commit 3f41819

Browse files
Address minor coderabbit comments
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 06eaf74 commit 3f41819

3 files changed

Lines changed: 29 additions & 26 deletions

File tree

examples/megatron_bridge/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ Note that the default dataset for pruning and quantization is [`nemotron-post-tr
4646
hf auth login --token <your token>
4747
```
4848

49+
> [!WARNING]
50+
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers.
51+
4952
## Pruning
5053

5154
This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).

examples/megatron_bridge/distill.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
import modelopt.torch.utils.distributed as dist
5454
from modelopt.torch.utils import print_rank_0
5555

56-
with contextlib.suppress(ImportError):
56+
with contextlib.suppress(ModuleNotFoundError):
5757
import modelopt.torch.puzzletron.plugins.mbridge # noqa: F401
5858

5959
SEED = 1234

examples/puzzletron/README.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -19,40 +19,34 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/
1919

2020
The recommended way to run puzzletron is inside an NVIDIA NeMo container (e.g. `nvcr.io/nvidia/nemo:26.02`). NeMo containers ship a pre-installed `nvidia-modelopt` that does not include the puzzletron extras so you need to replace it with an editable install from this repo.
2121

22+
> [!WARNING]
23+
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers.
24+
25+
> [!NOTE]
26+
> NeMo containers ship `nvidia-lm-eval` which may conflict with `lm-eval` that is used for evaluation, hence we uninstall and replace it with `lm-eval` from the repo.
27+
2228
Once inside the container with the repo available, install dependencies from the repo root:
2329

2430
```bash
2531
python -m pip uninstall nvidia-lm-eval -y 2>/dev/null
26-
python -m pip install -e ".[hf,puzzletron]"
32+
python -m pip install -e ".[hf,puzzletron,dev-test]"
2733
python -m pip install -r examples/puzzletron/requirements.txt
2834
```
2935

3036
To verify the install, you can run the GPU tests as a smoke check:
3137

3238
```bash
33-
python -m pytest -s -v tests/gpu/torch/puzzletron/test_puzzletron.py -o addopts="" -k "mistral"
34-
```
35-
36-
### Bare-metal / other containers
37-
38-
If you are not using a NeMo container, install Model-Optimizer in editable mode with the corresponding dependencies (run from the repo root):
39-
40-
```bash
41-
pip install -e .[hf,puzzletron]
42-
pip install -r examples/puzzletron/requirements.txt
39+
python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -k "Qwen3-8B"
4340
```
4441

45-
> **Note:** NeMo containers may ship `nvidia-lm-eval` which may conflict with `lm-eval` that is used for evaluation.
46-
> If so, run `pip uninstall nvidia-lm-eval -y` before installing requirements.
47-
4842
### Hardware
4943

5044
- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use a single GPU.
5145

5246
- To make use of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), you need to accept the terms and conditions for the corresponding model and the dataset in the Huggingface Hub. Log in to the Huggingface Hub and enter your HF token.
5347

5448
```bash
55-
hf auth login
49+
hf auth login --token <your token>
5650
```
5751

5852
## Compress the Model
@@ -62,7 +56,9 @@ hf auth login
6256
dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)
6357

6458
```bash
65-
python -m modelopt.torch.puzzletron.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
59+
python -m modelopt.torch.puzzletron.dataset.prepare_dataset \
60+
--dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 \
61+
--output_dir path/to/Nemotron-Post-Training-Dataset-v2
6662
```
6763

6864
2. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
@@ -160,7 +156,9 @@ This assumes pruning, replacement library building, NAS scoring, and subblock st
160156
For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.
161157
162158
```bash
163-
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
159+
torchrun --nproc_per_node 2 examples/puzzletron/main.py \
160+
--config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
161+
--mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
164162
```
165163
166164
This will generate the following network architecture (see `log.txt`):
@@ -241,7 +239,9 @@ The **MIP sweep mode** lets you explore multiple memory compression rates in a s
241239
2. Run the sweep:
242240
243241
```bash
244-
torchrun --nproc_per_node 2 examples/puzzletron/main.py --config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
242+
torchrun --nproc_per_node 2 examples/puzzletron/main.py \
243+
--config examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml \
244+
--mip-only 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
245245
```
246246
247247
3. View results: The CSV file contains compression rates, memory usage, and accuracy metrics for each configuration.
@@ -258,11 +258,11 @@ Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-e
258258
259259
```bash
260260
python examples/llm_eval/lm_eval_hf.py \
261-
--model hf \
262-
--model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
263-
--tasks mmlu \
264-
--num_fewshot 5 \
265-
--batch_size 4
261+
--model hf \
262+
--model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
263+
--tasks mmlu \
264+
--num_fewshot 5 \
265+
--batch_size 4
266266
```
267267
268268
For a quick smoke test, add `--limit 10`.
@@ -286,13 +286,13 @@ sed -i 's+subblocks_safetensors/++g' model.safetensors.index.json
286286
- Benchmark latency
287287
288288
```bash
289-
vllm bench latency --model path/to/model --load-format safetensors --trust-remote-code
289+
vllm bench latency --model path/to/model --load-format safetensors
290290
```
291291
292292
- Benchmark throughput
293293
294294
```bash
295-
vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --load-format safetensors --trust-remote-code
295+
vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --load-format safetensors
296296
```
297297
298298
## Knowledge Distillation

0 commit comments

Comments
 (0)