-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Description
I'm trying to load the quantized model RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 but encountering a dtype compatibility issue during model initialization. The model appears to be quantized using llmcompressor with W8A8 quantization scheme.
Note: I need to load this model without vLLM because I may need to add custom hooks for my research, so I'm looking for a direct loading method using transformers/llmcompressor.
Error Message
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8Full Stack Trace:
File "/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 366, in _init_weights
module.weight.data.normal_(mean=0.0, std=std)
File "/torch/_refs/__init__.py", line 6214, in normal_
return normal(mean, std, self.shape, out=self, generator=generator)
...
RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8Traceback
The error occurs during model weight initialization where transformers tries to call normal_() on int8 tensors. The normal_() function in PyTorch only works with floating-point tensors, but the quantized model contains int8 weights.
Specific failure point:
- File:
modeling_qwen2_5_vl.py, line 366 - Function:
_init_weights() - Operation:
module.weight.data.normal_(mean=0.0, std=std) - Issue: Trying to apply normal distribution to int8 tensors
Model Information
Based on the model's config.json:
- Quantization method:
compressed-tensors - Format:
int-quantized - Scheme: W8A8 (8-bit weights and activations)
- Base model:
Qwen/Qwen2.5-VL-7B-Instruct - Compression ratio: ~1.2x
- Ignored layers: All visual layers (
visual.blocks.*,visual.merger.*,lm_head)
What I've Tried
1. llmcompressor methods:
# Method 1: TraceableQwen2_5_VLForConditionalGeneration
from llmcompressor.transformers.tracing import TraceableQwen2_5_VLForConditionalGeneration
model = TraceableQwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
# Method 2: SparseAutoModelForCausalLM
from llmcompressor.transformers import SparseAutoModelForCausalLM
model = SparseAutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", torch_dtype="auto", trust_remote_code=True
)2. Standard transformers methods:
# Method 3: Various dtype configurations
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16, # Also tried: torch.float16, "auto", None
trust_remote_code=True,
device_map="auto"
)
# Method 4: AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype="auto"
)All methods fail at the same weight initialization step, so I wonder should the model be loaded with _fast_init=False or other special parameters?
Additional Observations
- Warning about ignored layers: The loader warns about missing visual layers, but this seems expected since they were ignored during quantization
- Model files exist: The quantized model directory contains the expected
.safetensorsfiles and configuration - Original model works: The base
Qwen/Qwen2.5-VL-7B-Instructloads and works perfectly
Environment
- Python: 3.10
- PyTorch: 2.7.0+cu126
- Transformers: 4.52.4
- LLMCompressor: 0.6.0
- Compressed-tensors: 0.10.2
This model was likely created using llmcompressor's oneshot quantization:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
recipe = [
GPTQModifier(
targets="Linear",
scheme="W8A8",
sequential_targets=["Qwen2_5_VLDecoderLayer"],
ignore=["lm_head", "re:visual.*"],
),
]If this is more of an llmcompressor-specific model loading issue rather than a transformers compatibility issue, please let me know and I'll file this issue in the llmcompressor repository instead.