|
2 | 2 |
|
3 | 3 | This example uses SparseML and Compressed-Tensors to create a 2:4 sparse and quantized Llama2-7b model.
|
4 | 4 | The model is calibrated and trained with the ultachat200k dataset.
|
5 |
| -At least 75GB of GPU memory is required to run this example. |
| 5 | +At least 85GB of GPU memory is required to run this example. |
6 | 6 |
|
7 |
| -Follow the steps below, or to run the example as `python examples/llama7b_sparse_quantized/llama7b_sparse_w4a16.py` |
| 7 | +Follow the steps below one by one in a code notebook, or run the full example script |
| 8 | +as `python examples/llama7b_sparse_quantized/llama7b_sparse_w4a16.py` |
8 | 9 |
|
9 | 10 | ## Step 1: Select a model, dataset, and recipe
|
10 | 11 | In this step, we select which model to use as a baseline for sparsification, a dataset to
|
@@ -36,7 +37,8 @@ recipe = "2:4_w4a16_recipe.yaml"
|
36 | 37 |
|
37 | 38 | ## Step 2: Run sparsification using `apply`
|
38 | 39 | The `apply` function applies the given recipe to our model and dataset.
|
39 |
| -The hardcoded kwargs may be altered based on each model's needs. |
| 40 | +The hardcoded kwargs may be altered based on each model's needs. This code snippet should |
| 41 | +be run in the same Python instance as step 1. |
40 | 42 | After running, the sparsified model will be saved to `output_llama7b_2:4_w4a16_channel`.
|
41 | 43 |
|
42 | 44 | ```python
|
@@ -67,14 +69,16 @@ apply(
|
67 | 69 | ### Step 3: Compression
|
68 | 70 |
|
69 | 71 | The resulting model will be uncompressed. To save a final compressed copy of the model
|
70 |
| -run the following: |
| 72 | +run the following in the same Python instance as the previous steps. |
71 | 73 |
|
72 | 74 | ```python
|
73 | 75 | import torch
|
| 76 | +import os |
74 | 77 | from sparseml.transformers import SparseAutoModelForCausalLM
|
75 | 78 |
|
76 | 79 | compressed_output_dir = "output_llama7b_2:4_w4a16_channel_compressed"
|
77 |
| -model = SparseAutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16) |
| 80 | +uncompressed_path = os.path.join(output_dir, "stage_quantization") |
| 81 | +model = SparseAutoModelForCausalLM.from_pretrained(uncompressed_path, torch_dtype=torch.bfloat16) |
78 | 82 | model.save_pretrained(compressed_output_dir, save_compressed=True)
|
79 | 83 | ```
|
80 | 84 |
|
|
0 commit comments