Release Intel® Extension for Transformers v1.1 Release · intel/intel-extension-for-transformers

Highlights

Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

Productivity

Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
Enable docker for Chatbot (commit 6b9522, 37b455)
Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
Update Torch and TensorFlow (commit f54817)
Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
Add summarization evaluation for PyTorch (commit 062e62)

Examples

Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
Electra fp32 & bf16 inference (commit e09c96)
GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
Onnx whisper-large quantization (commit 038be0)
8-layers MiniLM inference (commit 0dd104)
Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
Fix quantization for transformers 4.30 (commit 256c1d)
Fix error of missing metric when QAT on PyTorch model (commit c7e665)

Documentation

Validated Configurations

Provide feedback