Is there an existing issue for this?
Current Behavior
self_attention.dense.weight int4 shape [4096,2048] mismatch fp16 shape [4096, 4096]
which results failling to setup vllm server
Expected Behavior
chatglm2-6b-int4 can be deployed with vllm
Steps To Reproduce
none
Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Anything else?
s