ERNIEKit is an industrial-grade development toolkit for ERNIE 4.5. It provides training and compression capabilities, including Pre-Training, Supervised Fine-Tuning (SFT), Low-Rank Adaptation (LoRA), Direct Preference Optimization (DPO), and Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques. It includes practical applications and tutorials for leveraging ERNIE models.
[2025-09] 🔥 Released ERNIEKit v1.1: ERNIEKit now supports SFT/LoRA for ERNIE-4.5-VL series.
[2025-06] 🔥 Released ERNIEKit v1.0: We're excited to announce ERNIEKit v1.0, the most powerful and efficient toolkit yet for developing with the latest ERNIE models!
-
🚀 Industrial-grade High-Performance Pre-Training Optimized ERNIE 4.5 pre-training implementation featuring 3D hybrid parallelism and FP8 mixed precision acceleration. Please refer to Pre-Training for more details.
-
🪙 Low-bit Quantization-aware Fine-tuning To significantly lower the barriers and costs of fine-tuning and deploying the ERNIE 4.5 model, we introduce a novel FP8 Quantization-Aware Training (QAT) methodology. This solution synergistically integrates low-precision training with optimizer offloading. Consequently, the minimum resources for fine-tuning ERNIE 4.5-300B-A47B has been substantially reduced from 96 GPUs to only 16 GPUs, while maintaining the model's original performance. Crucially, unlike prevalent FP8 mixed-precision schemes that rely on online block-wise and tile-wise quantization, the models produced by ERNIEKit's QAT solution achieve a significant advantage: they support highly efficient offline tensor-wise FP8 quantization for inference. This eliminates the computational overhead associated with dynamic quantization at inference time. For more information, please refer to the FP8-QAT and WINT4/8-LoRA.
-
👁️ Visual Training & Debugging Interface Gradio-based WebUI for zero-code fine-tuning, alignment, and inference. Please refer to WebUI & CLI for more details.
-
🔌 Multiple Hardware Support Support NVDIA GPU, Kunlunxin XPU and Ascend NPU Training.
| Dependency | Recommended Version |
|---|---|
| CUDA | ≥ 12.3 |
| CUDA Driver | ≥ 535.171 |
| nvcc | ≥ 12.3 |
| gcc | ≥ 12.2 |
| Python | 3.10 - 3.12 |
| GPU Architecture | Ampere/Hopper (80GB+HBM) |
Docker-Based Installation (Recommended)
To ensure environment consistency across different hardware configurations, we recommend using our pre-configured Docker images. These images include CUDA, cuDNN, and NCCL dependencies with PaddlePaddle v3.1 pre-installed:
# Choose based on your CUDA version requirements:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.1.0-gpu-cuda12.9-cudnn9.9
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.1.0-gpu-cuda12.6-cudnn9.5Source Code Installation
If not using Docker, ensure your environment meets the prerequisites in 2.1. ERNIEKit requires PaddlePaddle v3.1+. See official PaddlePaddle Installation Guide for details.
Verify installation with:
python -c "import paddle;paddle.utils.run_check()"Successful installation shows:
PaddlePaddle works well on 8 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
git clone https://github.com/PaddlePaddle/ERNIE
cd ERNIE
python -m pip install -r requirements/gpu/requirements.txt
python -m pip install -e . # We recommend install in editable modeYou can also build docker image yourself which includes all the dependencies listed in requirements.txt. Please refer to build docker for more details.
Please refer to FastDeploy installation.
ERNIEKit supports training for the following models. Before initiating training please ensure:
- Environment setup is completed
- Your hardware meets the minimum resource requirements
| Model | Multimodal Model | Post-Training Method | Seq Length | Min Resources | Recommended Config |
|---|---|---|---|---|---|
| ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47B | ✅ | SFT-LORA | 8K | 16x80G A/H GPUs | run_sft_lora_8k.yaml |
| ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47B | ✅ | SFT-LORA | 32K | 16x80G A/H GPUs | run_sft_lora_32k.yaml |
| ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47B | ✅ | SFT-LORA(wint4/8) | 8K | 8x80G A/H GPUs | run_sft_wint8mix_lora_8k.yaml |
| ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47B | ✅ | SFT-LORA(wint4/8) | 32K | 8x80G A/H GPUs | run_sft_wint8mix_lora_32k.yaml |
| ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47B | ✅ | SFT-LORA(wint4/8) | 128K | 16x80G A/H GPUs | run_sft_wint8mix_lora_128k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT | 8K | 96x80G A/H GPUs | run_sft_8k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT | 32K | 112x80G A/H GPUs | run_sft_32k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT(FP8) | 8K | 16x80G H GPUs + 2TB CPU RAM | run_sft_fp8_8k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT(FP8) | 32K | 16x80G H GPUs + 2TB CPU RAM | run_sft_fp8_32k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT-LoRA(wint4/8) | 8K | 4x80G A/H GPUs | run_sft_wint8mix_lora_8k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | SFT-LoRA(wint4/8) | 32K | 8x80G A/H GPUs | run_sft_wint8mix_lora_32k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | DPO | 8K | 112x80G A/H GPUs | run_dpo_8k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | DPO | 32K | 112x80G A/H GPUs | run_dpo_32k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | DPO-LoRA | 8K | 16x80G A/H GPUs | run_dpo_lora_8k.yaml |
| ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47B | ❌ | DPO-LoRA | 32K | 16x80G A/H GPUs | run_dpo_lora_32k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT | 8K | 8x80G A/H GPUs | run_sft_8k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT | 32K | 8x80G A/H GPUs | run_sft_32k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT | 128K | 8x80G A/H GPUs | run_sft_128k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT-LoRA | 8K | 4x80G A/H GPUs | run_sft_lora_8k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT-LoRA | 32K | 4x80G A/H GPUs | run_sft_lora_32k.yaml |
| ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3B | ✅ | SFT-LoRA | 128K | 4x80G A/H GPUs | run_sft_lora_128k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | SFT | 8K | 8x80G A/H GPUs | run_sft_8k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | SFT | 32K | 8x80G A/H GPUs | run_sft_32k.yaml |
| ERNIE-4.5-21B-A3B-B base/ERNIE-4.5-21B-A3B | ❌ | SFT | 128K | 8x80G A/H GPUs | run_sft_128k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | SFT-LoRA(wint4/8) | 8K | 1x80G A/H GPUs | run_sft_wint8mix_lora_8k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | SFT-LoRA(wint4/8) | 32K | 1x80G A/H GPUs | run_sft_wint8mix_lora_32k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | DPO | 8K | 8x80G A/H GPUs | run_dpo_8k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | DPO | 32K | 8x80G A/H GPUs | run_dpo_32k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | DPO | 128K | 8x80G A/H GPUs | run_dpo_128k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | DPO-LoRA | 8K | 1x80G A/H GPUs | run_dpo_lora_8k.yaml |
| ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3B | ❌ | DPO-LoRA | 32K | 1x80G A/H GPUs | run_dpo_lora_32k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | SFT | 8K | 1x80G A/H GPU | run_sft_8k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | SFT | 32K | 1x80G A/H GPU | run_sft_32k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | SFT | 128K | 1x80G A/H GPU | run_sft_128k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | SFT-LoRA(wint4/8) | 8K | 1x80G A/H GPU | run_sft_wint8mix_lora_8k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | SFT-LoRA(wint4/8) | 32K | 1x80G A/H GPU | run_sft_wint8mix_lora_32k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | DPO | 8K | 1x80G A/H GPU | run_dpo_8k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | DPO | 32K | 1x80G A/H GPU | run_dpo_32k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | DPO | 128K | 1x80G A/H GPU | run_dpo_128k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | DPO-LoRA | 8K | 1x80G A/H GPU | run_dpo_lora_8k.yaml |
| ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3B | ❌ | DPO-LoRA | 32K | 1x80G A/H GPU | run_dpo_lora_32k.yaml |
ERNIEKit supports both alpaca and erniekit dataset formats. For detailed format specifications, refer to Dataset Guide.
We provide sample datasets in erniekit format for quick start, please refer to Demo Datasets .
Subsequent sections will demonstrate workflows using these sample datasets.
Supervised Fine-Tuning (SFT) adapts pre-trained language models using labeled datasets to enhance task-specific performance and instruction-following capabilities. This parameter-updating method:
- Requires high-quality annotated data
- Adjusts all model parameters
- Ideal for precision-critical specialized tasks
For configuration details: ⚙️ General Training Settings ⚙️ SFT Settings
Example 1: Full-Parameter Supervised Fine-tuning
The following example requires training on a single 80G A/H GPU machine.
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, SFT
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_8k.yaml# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 32K Sequence Length, SFT
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_32k.yamlExample 2: Parameter Efficient Fine-tuning
LoRA (Low-Rank Adaptation) leverages matrix low-rank decomposition techniques to achieve model fine-tuning by only adjusting a small number of new parameters. LoRA training reduces resource requirements while often delivering comparable or even superior performance to full-parameter fine-tuning on small datasets.
Compared to standard SFT, enabling LoRA training simply requires adding fine_tuning: LoRA to the training configuration. For more training parameters, refer to LoRA configurations.
The following example requires training on a single 80GB A/H GPU card.
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, SFT-LoRA
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_lora_8k.yamlViewing Training Logs
If your script specifies the logging_dir argument, we save VisualDL visualization results to that directory. Otherwise, results are stored at the path specified by output_dir.
Start VisualDL with the following command to view training logs:
visualdl --logdir ${YOUR_LOG_DIR} --host ${HOST_IP} --port ${PORT}Alignment Training is a crucial technique for ensuring the behavior of Large Language Models (LLMs) aligns with human intentions, values, or specific objectives. Its core goal is to address the issue of pretrained models being "powerful but uncontrollable," making model outputs safer, more reliable, and better aligned with human expectations.
Direct Preference Optimization (DPO) is a representative method for achieving human preference alignment. It directly fine-tunes model parameters on annotated preference data. Compared to RLHF, DPO offers higher training stability and lower computational overhead, establishing itself as a mainstream preference alignment approach.
For more training configurations, refer to Training configuration and DPO configuration.
Example 1: Full-Parameter Direct Preference Optimization
The following example requires training on a single 80G A/H GPU machine.
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, DPO
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_8k.yaml# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 32K Sequence Length, DPO
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_32k.yamlExample 2: Direct Preference Optimization with LoRA
The following example requires training on a single 80G A/H GPU machine.
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, DPO-LoRA
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_lora_8k.yamlAfter LoRA fine-tuning, merge LoRA weights with the main model weights. In multi-machine training scenarios:
path_to_checkpoints/
├── added_tokens.json
├── config.json
├── model-00001-of-00xxx.safetensors
├── model-00002-of-00xxx.safetensors
├── ...
├── model-00xxx-of-00xxx.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.modelTo merge LoRA parameters into the base model after training:
erniekit export examples/configs/ERNIE-4.5-0.3B/run_export.yaml lora=TrueTrained ERNIEKit weights can be directly deployed using FastDeploy through integrated CLI tools. Below is an example for ERNIE-4.5-0.3B:
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
erniekit server examples/configs/ERNIE-4.5-0.3B/run_chat.yaml
erniekit chat examples/configs/ERNIE-4.5-0.3B/run_chat.yaml