v0.4.5
- Update caches for 48GB gpus (Qwen2 VL/Llama3 8B)
- Add cpu-side packing
- Relax min size to 32
- fp16 acc fix
- add persistent SPLIT_K version
- fix tl.contiguous hint
- make m,n block sizes safe
- add BitNet support in helper
- add custom load_state_dict to allow weight serialization
- Update swizzle