Skip to content

Commit 574ae57

Browse files
committed
Merge remote-tracking branch 'origin' into kylesayrs/remove-double-init
2 parents a77bd0b + 29ddedb commit 574ae57

File tree

100 files changed

+1771
-632
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+1771
-632
lines changed

.github/workflows/test-check-transformers.yaml

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,41 @@ env:
1515
CLEARML_API_SECRET_KEY: ${{ secrets.CLEARML_API_SECRET_KEY }}
1616

1717
jobs:
18+
detect-changes:
19+
runs-on: ubuntu-latest
20+
21+
outputs:
22+
changes-present: ${{ steps.changed-files.outputs.any_modified }}
23+
24+
steps:
25+
- name: Checkout
26+
uses: actions/checkout@v4
27+
with:
28+
fetch-depth: 0
29+
- name: Get changed files
30+
id: changed-files
31+
uses: tj-actions/changed-files@v45
32+
with:
33+
files: |
34+
**
35+
!examples/**
36+
!tests/e2e/**
37+
!tests/lmeval/**
38+
!tests/examples/**
39+
!**/*.md
40+
!.github/**
41+
.github/workflows/test-check-transformers.yaml
42+
43+
- name: Log relevant output
44+
run: |
45+
echo "changes-present: ${{ steps.changed-files.outputs.any_modified }}"
46+
echo "all modified files: ${{ steps.changed-files.outputs.all_modified_files }}"
47+
shell: bash
48+
1849
transformers-tests:
50+
needs: [detect-changes]
1951
runs-on: gcp-k8s-vllm-l4-solo
20-
if: contains(github.event.pull_request.labels.*.name, 'ready') || github.event_name == 'push'
52+
if: (contains(github.event.pull_request.labels.*.name, 'ready') || github.event_name == 'push') && needs.detect-changes.outputs.changes-present == 'true'
2153
steps:
2254
- uses: actions/setup-python@v5
2355
with:

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -800,5 +800,6 @@ integrations/pytorch/pytorch_vision*
800800
nm_temp_test_logs/*
801801
sparse_logs/*
802802
wandb/
803+
timings/
803804
output_finetune/
804805
env_log.json

README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,9 @@ Note that the model can be swapped for a local or remote HF-compatible checkpoin
5656
Quantization is applied by selecting an algorithm and calling the `oneshot` API.
5757

5858
```python
59-
from llmcompressor.modifiers.quantization import GPTQModifier
6059
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
61-
from llmcompressor.transformers import oneshot
62-
from transformers import AutoModelForCausalLM
60+
from llmcompressor.modifiers.quantization import GPTQModifier
61+
from llmcompressor import oneshot
6362

6463
# Select quantization algorithm. In this case, we:
6564
# * apply SmoothQuant to make the activations easier to quantize

examples/big_models_with_accelerate/cpu_offloading_fp8.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from transformers import AutoModelForCausalLM, AutoTokenizer
22

3+
from llmcompressor import oneshot
34
from llmcompressor.modifiers.quantization import QuantizationModifier
4-
from llmcompressor.transformers import oneshot
55

66
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
77
OUTPUT_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"

examples/big_models_with_accelerate/mult_gpus_int8_device_map.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
from datasets import load_dataset
33
from transformers import AutoModelForCausalLM, AutoTokenizer
44

5+
from llmcompressor import oneshot
56
from llmcompressor.modifiers.quantization import GPTQModifier
67
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
7-
from llmcompressor.transformers import oneshot
88
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
99

1010
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"

examples/big_models_with_accelerate/multi_gpu_int8.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from datasets import load_dataset
22
from transformers import AutoModelForCausalLM, AutoTokenizer
33

4+
from llmcompressor import oneshot
45
from llmcompressor.modifiers.quantization import GPTQModifier
5-
from llmcompressor.transformers import oneshot
66

77
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
88
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic"

examples/multimodal_audio/whisper_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
from datasets import load_dataset
33
from transformers import WhisperProcessor
44

5+
from llmcompressor import oneshot
56
from llmcompressor.modifiers.quantization import GPTQModifier
6-
from llmcompressor.transformers import oneshot
77
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
88

99
# Select model and load it.

examples/multimodal_vision/idefics3_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
from PIL import Image
55
from transformers import AutoProcessor
66

7+
from llmcompressor import oneshot
78
from llmcompressor.modifiers.quantization import GPTQModifier
8-
from llmcompressor.transformers import oneshot
99
from llmcompressor.transformers.tracing import TraceableIdefics3ForConditionalGeneration
1010

1111
# Load model.

examples/multimodal_vision/llava_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
from PIL import Image
44
from transformers import AutoProcessor
55

6+
from llmcompressor import oneshot
67
from llmcompressor.modifiers.quantization import GPTQModifier
7-
from llmcompressor.transformers import oneshot
88
from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration
99

1010
# Load model.

examples/multimodal_vision/mllama_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
from PIL import Image
44
from transformers import AutoProcessor
55

6+
from llmcompressor import oneshot
67
from llmcompressor.modifiers.quantization import GPTQModifier
7-
from llmcompressor.transformers import oneshot
88
from llmcompressor.transformers.tracing import TraceableMllamaForConditionalGeneration
99

1010
# Load model.

examples/multimodal_vision/phi3_vision_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55
from datasets import load_dataset
66
from transformers import AutoModelForCausalLM, AutoProcessor
77

8+
from llmcompressor import oneshot
89
from llmcompressor.modifiers.quantization import GPTQModifier
9-
from llmcompressor.transformers import oneshot
1010

1111
# Load model.
1212
model_id = "microsoft/Phi-3-vision-128k-instruct"

examples/multimodal_vision/pixtral_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
from PIL import Image
44
from transformers import AutoProcessor
55

6+
from llmcompressor import oneshot
67
from llmcompressor.modifiers.quantization import GPTQModifier
7-
from llmcompressor.transformers import oneshot
88
from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration
99

1010
# Load model.

examples/multimodal_vision/qwen2_vl_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
from qwen_vl_utils import process_vision_info
77
from transformers import AutoProcessor
88

9+
from llmcompressor import oneshot
910
from llmcompressor.modifiers.quantization import GPTQModifier
10-
from llmcompressor.transformers import oneshot
1111
from llmcompressor.transformers.tracing import TraceableQwen2VLForConditionalGeneration
1212

1313
# Load model.
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
import base64
2+
from io import BytesIO
3+
4+
import torch
5+
from datasets import load_dataset
6+
from qwen_vl_utils import process_vision_info
7+
from transformers import AutoProcessor
8+
9+
from llmcompressor.modifiers.quantization import GPTQModifier
10+
from llmcompressor.transformers import oneshot
11+
from llmcompressor.transformers.tracing import (
12+
TraceableQwen2_5_VLForConditionalGeneration,
13+
)
14+
15+
# Load model.
16+
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
17+
model = TraceableQwen2_5_VLForConditionalGeneration.from_pretrained(
18+
model_id,
19+
device_map="auto",
20+
torch_dtype="auto",
21+
)
22+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
23+
24+
# Oneshot arguments
25+
DATASET_ID = "lmms-lab/flickr30k"
26+
DATASET_SPLIT = {"calibration": "test[:512]"}
27+
NUM_CALIBRATION_SAMPLES = 512
28+
MAX_SEQUENCE_LENGTH = 2048
29+
30+
# Load dataset and preprocess.
31+
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
32+
ds = ds.shuffle(seed=42)
33+
34+
35+
# Apply chat template and tokenize inputs.
36+
def preprocess_and_tokenize(example):
37+
# preprocess
38+
buffered = BytesIO()
39+
example["image"].save(buffered, format="PNG")
40+
encoded_image = base64.b64encode(buffered.getvalue())
41+
encoded_image_text = encoded_image.decode("utf-8")
42+
base64_qwen = f"data:image;base64,{encoded_image_text}"
43+
messages = [
44+
{
45+
"role": "user",
46+
"content": [
47+
{"type": "image", "image": base64_qwen},
48+
{"type": "text", "text": "What does the image show?"},
49+
],
50+
}
51+
]
52+
text = processor.apply_chat_template(
53+
messages, tokenize=False, add_generation_prompt=True
54+
)
55+
image_inputs, video_inputs = process_vision_info(messages)
56+
57+
# tokenize
58+
return processor(
59+
text=[text],
60+
images=image_inputs,
61+
videos=video_inputs,
62+
padding=False,
63+
max_length=MAX_SEQUENCE_LENGTH,
64+
truncation=True,
65+
)
66+
67+
68+
ds = ds.map(preprocess_and_tokenize, remove_columns=ds["calibration"].column_names)
69+
70+
71+
# Define a oneshot data collator for multimodal inputs.
72+
def data_collator(batch):
73+
assert len(batch) == 1
74+
return {key: torch.tensor(value) for key, value in batch[0].items()}
75+
76+
77+
# Recipe
78+
recipe = [
79+
GPTQModifier(
80+
targets="Linear",
81+
scheme="W4A16",
82+
sequential_targets=["Qwen2_5_VLDecoderLayer"],
83+
ignore=["lm_head", "re:visual.*"],
84+
),
85+
]
86+
87+
# Perform oneshot
88+
oneshot(
89+
model=model,
90+
tokenizer=model_id,
91+
dataset=ds,
92+
recipe=recipe,
93+
max_seq_length=MAX_SEQUENCE_LENGTH,
94+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
95+
trust_remote_code_model=True,
96+
data_collator=data_collator,
97+
)
98+
99+
# Confirm generations of the quantized model look sane.
100+
print("========== SAMPLE GENERATION ==============")
101+
messages = [
102+
{
103+
"role": "user",
104+
"content": [
105+
{
106+
"type": "image",
107+
"image": "http://images.cocodataset.org/train2017/000000231895.jpg",
108+
},
109+
{"type": "text", "text": "Please describe the animal in this image\n"},
110+
],
111+
}
112+
]
113+
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
114+
image_inputs, video_inputs = process_vision_info(messages)
115+
inputs = processor(
116+
text=[prompt],
117+
images=image_inputs,
118+
videos=video_inputs,
119+
padding=False,
120+
max_length=MAX_SEQUENCE_LENGTH,
121+
truncation=True,
122+
return_tensors="pt",
123+
).to("cuda")
124+
output = model.generate(**inputs, max_new_tokens=100)
125+
print(processor.decode(output[0], skip_special_tokens=True))
126+
print("==========================================")
127+
128+
129+
# Save to disk compressed.
130+
SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128"
131+
model.save_pretrained(SAVE_DIR, save_compressed=True)
132+
processor.save_pretrained(SAVE_DIR)

examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
bf16 = False # using full precision for training
3434
lr_scheduler_type = "cosine"
3535
warmup_ratio = 0.1
36+
preprocessing_num_workers = 8
3637

3738
# this will run the recipe stage by stage:
3839
# oneshot sparsification -> finetuning -> oneshot quantization
@@ -52,10 +53,11 @@
5253
learning_rate=learning_rate,
5354
lr_scheduler_type=lr_scheduler_type,
5455
warmup_ratio=warmup_ratio,
56+
preprocessing_num_workers=preprocessing_num_workers,
5557
)
5658
logger.info(
57-
"Note: llcompressor does not currently support running ",
58-
"compressed models in the marlin-24 format. The model ",
59-
"produced from this example can be run on vLLM with ",
60-
"dtype=torch.float16",
59+
"llmcompressor does not currently support running compressed models in the marlin24 format." # noqa
60+
)
61+
logger.info(
62+
"The model produced from this example can be run on vLLM with dtype=torch.float16"
6163
)

examples/quantization_kv_cache/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Configure and apply the FP8 quantization for weights, activations, and KV cache.
7575
Notice the new `kv_cache_scheme` section:
7676

7777
```python
78-
from llmcompressor.transformers import oneshot
78+
from llmcompressor import oneshot
7979

8080
recipe = """
8181
quant_stage:

examples/quantization_kv_cache/gemma2_fp8_kv_example.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from datasets import load_dataset
22
from transformers import AutoModelForCausalLM, AutoTokenizer
33

4-
from llmcompressor.transformers import oneshot
4+
from llmcompressor import oneshot
55

66
# Select model and load it.
77
MODEL_ID = "google/gemma-2-9b-it"
@@ -86,6 +86,10 @@ def process_and_tokenize(example):
8686
"Please use vLLM for inference with the quantized kv_cache.",
8787
)
8888
# Confirm generations of the quantized model look sane.
89+
90+
# NOTE: transformers 4.49.0 results in a generation error with gemma2.
91+
# Consider either downgrading your transformers version to a previous version
92+
# or use vLLM for sample generation.
8993
print("\n\n")
9094
print("========== SAMPLE GENERATION ==============")
9195
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")

examples/quantization_kv_cache/llama3_fp8_kv_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
from loguru import logger
33
from transformers import AutoModelForCausalLM, AutoTokenizer
44

5-
from llmcompressor.transformers import oneshot
5+
from llmcompressor import oneshot
66

77
# Select model and load it.
88
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

examples/quantization_kv_cache/phi3.5_fp8_kv_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from datasets import load_dataset
22
from transformers import AutoModelForCausalLM, AutoTokenizer
33

4-
from llmcompressor.transformers import oneshot
4+
from llmcompressor import oneshot
55

66
# Select model and load it.
77
# Phi-3.5 is a special case for KV cache quantization because it has

examples/quantization_w4a16/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ In our case, we will apply the default GPTQ recipe for `int4` (which uses static
8686
> See the `Recipes` documentation for more information on making complex recipes
8787
8888
```python
89-
from llmcompressor.transformers import oneshot
89+
from llmcompressor import oneshot
9090
from llmcompressor.modifiers.quantization import GPTQModifier
9191

9292
# Configure the quantization algorithm to run.

examples/quantization_w8a8_fp8/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which
5454
Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
5555

5656
```python
57-
from llmcompressor.transformers import oneshot
57+
from llmcompressor import oneshot
5858
from llmcompressor.modifiers.quantization import QuantizationModifier
5959

6060
# Configure the simple PTQ quantization

0 commit comments

Comments
 (0)