-
Notifications
You must be signed in to change notification settings - Fork 490
add example of w8a8fp8 for qwen3.5 #2631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,45 @@ | ||||||||||||||||||||
| import torch | ||||||||||||||||||||
| from compressed_tensors.utils import save_mtp_tensors_to_checkpoint | ||||||||||||||||||||
| from datasets import load_dataset | ||||||||||||||||||||
| from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration | ||||||||||||||||||||
|
|
||||||||||||||||||||
| from llmcompressor import oneshot | ||||||||||||||||||||
| from llmcompressor.modifiers.quantization import QuantizationModifier | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # NOTE: This example requires transformers >= v5 | ||||||||||||||||||||
|
|
||||||||||||||||||||
| MODEL_ID = "Qwen/Qwen3.5-122B-A10B" | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Load model. | ||||||||||||||||||||
| model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto") | ||||||||||||||||||||
| processor = AutoProcessor.from_pretrained(MODEL_ID) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # No need to include mtp layers as they are not loaded | ||||||||||||||||||||
| # through Qwen3_5MoeForConditionalGeneration | ||||||||||||||||||||
| recipe = QuantizationModifier( | ||||||||||||||||||||
| targets="Linear", | ||||||||||||||||||||
| scheme="FP8_DYNAMIC", | ||||||||||||||||||||
| ignore=[ | ||||||||||||||||||||
| "re:.*lm_head", | ||||||||||||||||||||
| "re:visual.*", | ||||||||||||||||||||
| "re:model.visual.*", | ||||||||||||||||||||
| "re:.*mlp.gate$", | ||||||||||||||||||||
| "re:.*embed_tokens$", | ||||||||||||||||||||
| "re:.*shared_expert_gate$", | ||||||||||||||||||||
| "re:.*linear_attn.*", | ||||||||||||||||||||
| ], | ||||||||||||||||||||
|
Comment on lines
+22
to
+30
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The ignore=[
"re:.*lm_head",
"re:.*mlp.gate$",
"re:.*embed_tokens$",
"re:.*shared_expert_gate$",
], |
||||||||||||||||||||
| ) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Apply quantization. | ||||||||||||||||||||
| oneshot( | ||||||||||||||||||||
| model=model, | ||||||||||||||||||||
| recipe=recipe) | ||||||||||||||||||||
|
Comment on lines
+34
to
+36
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
Comment on lines
+34
to
+36
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Read-only verification of oneshot recipe handling and usage patterns
rg -n -C4 '^def oneshot\(' src/llmcompressor/entrypoints/oneshot.py
rg -n -C4 'recipe:' src/llmcompressor/entrypoints/oneshot.py
rg -n -C4 'isinstance\(recipe|parse.*recipe|load.*recipe|Modifier' src/llmcompressor/entrypoints/oneshot.py
rg -n --type=py -C2 'oneshot\(' examplesRepository: vllm-project/llm-compressor Length of output: 38177 🏁 Script executed: #!/bin/bash
# Check the definition of recipe in qwen3_5_example.py
head -40 examples/quantization_w8a8_fp8/qwen3_5_example.py | tail -20Repository: vllm-project/llm-compressor Length of output: 526 🏁 Script executed: #!/bin/bash
# Search for how recipe parameter is actually used inside oneshot function
rg -A 20 'def oneshot\(' src/llmcompressor/entrypoints/oneshot.py | head -50Repository: vllm-project/llm-compressor Length of output: 850 🏁 Script executed: #!/bin/bash
# Check if recipe is coerced or converted to string, or if objects are accepted
rg -n 'recipe' src/llmcompressor/entrypoints/oneshot.py | grep -E '(isinstance|type|convert|parse|Modifier)' | head -20Repository: vllm-project/llm-compressor Length of output: 241 🏁 Script executed: #!/bin/bash
# Read full qwen3_5_example.py to see recipe definition
cat -n examples/quantization_w8a8_fp8/qwen3_5_example.pyRepository: vllm-project/llm-compressor Length of output: 1796 🏁 Script executed: #!/bin/bash
# Find parse_args function to understand recipe parameter handling
rg -n 'def parse_args' src/llmcompressor/entrypoints/oneshot.pyRepository: vllm-project/llm-compressor Length of output: 53 🏁 Script executed: #!/bin/bash
# Check oneshot.py imports and look for parse_args
head -100 src/llmcompressor/entrypoints/oneshot.pyRepository: vllm-project/llm-compressor Length of output: 4047 🏁 Script executed: #!/bin/bash
# Search for parse_args in entire src directory
rg -n 'def parse_args' src/Repository: vllm-project/llm-compressor Length of output: 277 🏁 Script executed: #!/bin/bash
# Read parse_args implementation
sed -n '21,150p' src/llmcompressor/args/utils.pyRepository: vllm-project/llm-compressor Length of output: 1702 🏁 Script executed: #!/bin/bash
# Check how recipe is handled after parse_args in oneshot function
rg -A 150 'def oneshot\(' src/llmcompressor/entrypoints/oneshot.py | head -200Repository: vllm-project/llm-compressor Length of output: 7783 🏁 Script executed: #!/bin/bash
# Find RecipeArguments class to see how it handles recipe parameter
rg -n 'class RecipeArguments' src/Repository: vllm-project/llm-compressor Length of output: 139 🏁 Script executed: #!/bin/bash
# Read RecipeArguments implementation
find src -name "*recipe*args*" -type fRepository: vllm-project/llm-compressor Length of output: 53 🏁 Script executed: #!/bin/bash
# Read RecipeArguments class
cat -n src/llmcompressor/args/recipe_arguments.pyRepository: vllm-project/llm-compressor Length of output: 1532 🏁 Script executed: #!/bin/bash
# Look for how recipe is actually processed in Oneshot class
rg -A 50 'class Oneshot:' src/llmcompressor/entrypoints/oneshot.py | head -80Repository: vllm-project/llm-compressor Length of output: 2226 🏁 Script executed: #!/bin/bash
# Read the Oneshot.__init__ method to see how recipe is actually handled
rg -A 100 'def __init__' src/llmcompressor/entrypoints/oneshot.py | head -150Repository: vllm-project/llm-compressor Length of output: 4009 🏁 Script executed: #!/bin/bash
# Check how Oneshot uses recipe after initialization
rg -n 'self\.recipe' src/llmcompressor/entrypoints/oneshot.pyRepository: vllm-project/llm-compressor Length of output: 364 🏁 Script executed: #!/bin/bash
# Look at apply_recipe_modifiers to see how recipe is used
rg -A 100 'def apply_recipe_modifiers' src/llmcompressor/entrypoints/oneshot.pyRepository: vllm-project/llm-compressor Length of output: 3644 🏁 Script executed: #!/bin/bash
# Check if there's any special handling for Modifier objects in parse_args or elsewhere
rg -n 'isinstance.*Modifier|Modifier.*recipe' src/llmcompressor/entrypoints/oneshot.pyRepository: vllm-project/llm-compressor Length of output: 53 🏁 Script executed: #!/bin/bash
# Check if HfArgumentParser from transformers can handle non-string objects
# Let's look at how parse_dict is actually used and if there's special handling
rg -B 5 -A 15 'parser.parse_dict' src/llmcompressor/args/utils.pyRepository: vllm-project/llm-compressor Length of output: 855 🏁 Script executed: #!/bin/bash
# Check if there's any special isinstance or type checking for recipe before parse_args
rg -B 10 'parse_args\(' src/llmcompressor/entrypoints/oneshot.py | head -30Repository: vllm-project/llm-compressor Length of output: 457 🏁 Script executed: #!/bin/bash
# Check what HfArgumentParser.parse_dict does with non-matching types
# Look for how recipe value is actually handled when it's not a string
python3 << 'EOF'
from transformers.hf_argparse import HfArgumentParser
from dataclasses import dataclass
from typing import Optional
`@dataclass`
class TestArgs:
recipe: Optional[str] = None
parser = HfArgumentParser(TestArgs)
# Test 1: passing a string (normal case)
try:
result1 = parser.parse_dict({"recipe": "path/to/recipe.yaml"})
print(f"Test 1 (string): SUCCESS - {result1}")
except Exception as e:
print(f"Test 1 (string): FAILED - {type(e).__name__}: {e}")
# Test 2: passing a non-string object
class FakeModifier:
def __init__(self):
self.name = "test_modifier"
try:
result2 = parser.parse_dict({"recipe": FakeModifier()})
print(f"Test 2 (object): SUCCESS - recipe type = {type(result2[0].recipe)}")
print(f" recipe value = {result2[0].recipe}")
except Exception as e:
print(f"Test 2 (object): FAILED - {type(e).__name__}: {e}")
EOFRepository: vllm-project/llm-compressor Length of output: 195 🏁 Script executed: #!/bin/bash
# Check session.initialize to see how it handles recipe parameter
rg -n 'def initialize' src/llmcompressor/core/session_functions.pyRepository: vllm-project/llm-compressor Length of output: 53 🏁 Script executed: #!/bin/bash
# Search for how recipe is actually processed in CompressionSession or related classes
rg -n 'def initialize' src/llmcompressor/core/ | head -20Repository: vllm-project/llm-compressor Length of output: 186 🏁 Script executed: #!/bin/bash
# Check session.initialize signature and how it handles recipe
sed -n '74,150p' src/llmcompressor/core/session.pyRepository: vllm-project/llm-compressor Length of output: 3509 🏁 Script executed: #!/bin/bash
# Check lifecycle.initialize
sed -n '73,150p' src/llmcompressor/core/lifecycle.pyRepository: vllm-project/llm-compressor Length of output: 2616 🏁 Script executed: #!/bin/bash
# Check RecipeInput type definition
rg -n 'RecipeInput' src/llmcompressor/core/lifecycle.pyRepository: vllm-project/llm-compressor Length of output: 207 🏁 Script executed: #!/bin/bash
# Find Recipe class and create_instance method
rg -n 'class Recipe' src/ | head -5Repository: vllm-project/llm-compressor Length of output: 202 🏁 Script executed: #!/bin/bash
# Find RecipeInput definition
rg -n 'RecipeInput' src/llmcompressor/recipe/__init__.pyRepository: vllm-project/llm-compressor Length of output: 170 🏁 Script executed: #!/bin/bash
# Find RecipeInput definition in recipe module
rg -n 'RecipeInput' src/llmcompressor/recipe/ | head -10Repository: vllm-project/llm-compressor Length of output: 424 🏁 Script executed: #!/bin/bash
# Check Recipe.create_instance to confirm it handles Modifier objects
rg -A 50 'def create_instance' src/llmcompressor/recipe/recipe.py | head -80Repository: vllm-project/llm-compressor Length of output: 2198 🏁 Script executed: #!/bin/bash
# Verify QuantizationModifier is a subclass of Modifier
rg -n 'class QuantizationModifier' src/llmcompressor/modifiers/quantization/Repository: vllm-project/llm-compressor Length of output: 192 🏁 Script executed: #!/bin/bash
# Check if there's any special handling of recipe parameter BEFORE it reaches parse_args
# in the oneshot function
sed -n '249,380p' src/llmcompressor/entrypoints/oneshot.py | tail -100Repository: vllm-project/llm-compressor Length of output: 5249 🏁 Script executed: #!/bin/bash
# Check the full oneshot function body after the signature to see if recipe is processed
rg -A 200 'def oneshot\(' src/llmcompressor/entrypoints/oneshot.py | grep -A 50 'model_args, dataset_args'Repository: vllm-project/llm-compressor Length of output: 53 Update type hints or documentation to support The 🤖 Prompt for AI Agents |
||||||||||||||||||||
|
|
||||||||||||||||||||
| # Save to disk in compressed-tensors format. | ||||||||||||||||||||
| SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic" | ||||||||||||||||||||
| model.save_pretrained(SAVE_DIR) | ||||||||||||||||||||
| processor.save_pretrained(SAVE_DIR) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration | ||||||||||||||||||||
| # Save them as-is from the original checkpoint into the quantized output. | ||||||||||||||||||||
| save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR) | ||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imports
torchandload_datasetare not used in this script. Sinceoneshotcan accept a dataset name as a string,load_datasetis unnecessary. These should be removed to keep the example clean.