Open
Description
🚀 The feature, motivation and pitch
In the past, we padded int4 quantization with non-multiple group size to make things work. Since we have decided to remove the padding, int4 quantization is now simply skipped for non-multiple groups. This means, among other things, that int4 quantization is no longer tested because the stories model uses non-multiple-of-256.
Time to load model: 0.19 seconds
Quantizing the model with: {'executor': {'accelerator': 'cuda'}, 'precision': {'dtype': 'bf16'}, 'linear:int4': {'groupsize': 256}}
Skipping quantizing weight with int4 weight only quantization because the shape of weight torch.Size([288, 288]) is not compatible with group_size 256
Skipping quantizing weight with int4 weight only quantization because the shape of weight torch.Size([288, 288]) is not compatible with group_size 256
Skipping quantizing weight with int4 weight only quantization because the shape of weight torch.Size([288, 288]) is not compatible with group_size 256
Skipping quantizing weight with int4 weight only quantization because the shape of weight torch.Size([288, 288]) is not compatible with group_size 256
Some options:
- replace stories with another model that meets the requirement
- add other tests for int4 quantization in tc
Alternatives
Put padding back into int4 quantization.
Yes, it's not ideal, then again, suppressing quantization is not either. In my own experience, just making things work increases utility for end users, if there's real concern about performance (int4 quantization with padding may still beat non-quantization!), pad and issue a warning to users.
Additional context
No response
RFC (Optional)
No response