This repository was archived by the owner on Oct 25, 2024. It is now read-only.
Intel® Extension for Transformers v1.1 Release
- Highlights
- Features
- Productivity
- Examples
- Bug Fixing
- Documentation
Highlights
- Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
- Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
- Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
- Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types
Features
- Model Optimization
- Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
- Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
- Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
- Enable QAT for Stable Diffusion (commit 2e2efd)
- Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
- Transformers-accelerated Neural Engine
- Support PyTorch model as input of Neural Engine (commit e83a51, 3625db)
- Inference with cpp graph: MPT-7B, LLAMA-7B, GPT-NeoX-20B (commit 970bfa), Falcon-7B (commit 762723)
- Inference with weight-only compression (commit d87132, 0065db, d30eff)
- Reduce memory usage of inference (commit 36f3e9, 2dc594, 3f6b47, 5f75df, 7860f9)
- Stable Diffusion on Windows (commit 52d5e6)
- MHA for Bert (commit 59af3af)
- Transformers-accelerated Libraries
- MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
- Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
- Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
- Support dynamic quantization op (commit 6fcc15)
- Add AVX2 kernels for Windows (commit bc313c)
Productivity
- Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
- Enable docker for Chatbot (commit 6b9522, 37b455)
- Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
- Update Torch and TensorFlow (commit f54817)
- Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
- Add summarization evaluation for PyTorch (commit 062e62)
Examples
- Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
- Electra fp32 & bf16 inference (commit e09c96)
- GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
- Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
- Onnx whisper-large quantization (commit 038be0)
- 8-layers MiniLM inference (commit 0dd104)
- Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)
Bug Fixing
- Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
- Fix quantization for transformers 4.30 (commit 256c1d)
- Fix error of missing metric when QAT on PyTorch model (commit c7e665)
Documentation
- Refine doc of NeuralChat (commit 2580f3)
- Update performance data of LLM and Stable Diffusion (commit 523fe5)
Validated Configurations
- Centos 8.4 & Ubuntu 20.04 & Windows 10
- Python 3.8, 3.9, 3.10
- Intel® Extension for TensorFlow 2.11.0, 2.12.0
- PyTorch 1.13.1+cpu, 2.0.0+cpu
- Intel® Extension for PyTorch 1.13.1+cpu,2.0.0+cpu