GitHub - Tencent/AngelSlim: Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

English | 简体中文

AngelSlim

A more accessible, comprehensive, and efficient toolkit for large model compression.

📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat | 🫨 Discord

📣Latest News

[25/11/03] We have released v0.2. Quantization support for new models, such as GLM-4.6 and Qwen3-VL, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
[25/09/30] We have released SpecExit, the reasoning early-exit algorithm: [Paper] | [Docs] | [vLLM Code]🔥🔥🔥
[25/09/26] We have released TEQUILA, the ternary quantization algorithm [Paper] | [Code]🔥🔥🔥
[25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource Qwen3-32B-NVFP4 and Qwen3-235B-A22B-NVFP4 weights.

Previous News

[25/09/01] We now support FP8 quantization of the Hunyuan-MT-7B translation model. And enabled Torch inference and Benchmark evaluation for Eagle3. And implemented support for quantization and Cache for FLUX. And support quantization for the Seed-OSS.
[25/08/06] We now support quantization for Hunyuan 0.5B/1.8B/4B/7B and multimodal model Qwen2.5VL 3B/7B/32B/72B, including FP8/INT4 algorithms, and quantization for DeepSeek-R1/V3 and Kimi-K2, including FP8-Static and W4A8-FP8 algorithms. We also opensource Hunyuan 1.8B/4B/7B series Eagle3 model weight.
[25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms. We also opensource Qwen3 series Eagle3 model weight.

🌟Key Features

Highly Integrated: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
Continuous Innovation: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
Performance-Driven: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.

💼Technical Overview

Scenario	Model	Compression Strategy
Scenario	Model	Quantization	Speculative Decoding	Other Techniques
Large Language Models (LLMs)	Hunyuan-Dense Hunyuan-MoE Qwen3 DeepSeek-V3/R1 GLM-4.6 Qwen2.5	FP8-Static/Dynamic INT8-Dynamic INT4-GPTQ/AWQ/GPTAQ NVFP4 LeptoQuant Tequila	Eagle3 SpecExit	Sparse Attention Under Development
Vision Language Models (VLMs)	Hunyuan-VL Qwen3-VL Qwen2.5-VL	FP8-Static/Dynamic INT8-Dynamic INT4-GPTQ/AWQ/GPTAQ	Eagle3(ing)	Token Pruning Under Development
Diffusion Models	Hunyuan-Image Hunyuan-Video Hunyuan-3D Qwen-Image FLUX Wan SDXL	FP8-Dynamic FP8-Weight-Only	-	Cache Technology DeepCache TeaCache Sparse Attention Under Development
Speech Models (TTS/ASR)	Qwen3-Omni	Under Development	Under Development	Token Pruning Under Development

🛎️How to Use

1. Install AngelSlim

We recommend using pip to install the latest stable version of AngelSlim:

pip install angelslim

Alternatively, you can clone the repository and install from source in editable mode:

cd AngelSlim && python setup.py install

For more detailed installation instructions, please refer to the Installation Documentation.

2. Quick Start

Quantization

After installing AngelSlim, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script:

python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml

This example produces quantized model weights by performing PTQ calibration on a model loaded from HuggingFace.

Code-based Start

To perform dynamic FP8 quantization on Qwen3-1.7B:

from angelslim.engine import Engine

slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")

For more details, please refer to the Quick Start Documentation.

Speculative Decoding

After installing AngelSlim, you can quickly start Eagle3 training with the following scripts:
```
# Start the vLLM server
bash scripts/speculative/run_vllm_server.sh
# Generate training data
bash scripts/speculative/generate_data_for_target_model.sh
# Perform online training for the Eagle3 model
bash scripts/speculative/train_eagle3_online.sh
```
For detailed training configurations and PyTorch performance benchmarks of Eagle3, please refer to the Quick Start Guide for Speculative Sampling.

Diffusion Model Quantization

Use the scripts/diffusion/run_diffusion.py for quantization and inference:

# Online quantization and inference
python scripts/diffusion/run_diffusion.py \
  --model-name-or-path black-forest-labs/FLUX.1-schnell \
  --quant-type fp8-per-tensor \
  --prompt "A cat holding a sign that says hello world" \
  --height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0

For more quantization inference methods, please refer to the Diffusion Model Quantization Documentation.

3. Deployment and Testing

3.1 Offline Inference

To test offline inference with a quantized model loaded via transformers, run the following command:

python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"

Where MODEL_PATH is the path to the quantized model output. please set the deploy_backend: huggingface in the global configuration before quantizing the model, or manually modify the ignored_layers field in the config.json file located in the quantized model output directory to ignore.

3.2 API Service Deployment

After specifying the quantized model path MODEL_PATH, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:

vLLM

Use the following script to launch a vLLM server, recommended version vllm>=0.8.5.post1. For MOE INT8 quantized models, vllm>=0.9.0 is required.
```
bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
```
Where -d is the visible devices, -t is tensor parallel size, -p is pipeline parallel size, and -g is the GPU memory utilization.

SGLang

Use the following script to launch a SGLang server, recommended version sglang>=0.4.6.post1.

bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8

3.3 Service Invocation

Invoke requests via OpenAI's API format:

bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."

where -p is the input prompt.

3.4 Performance Evaluation

Evaluate the performance of quantized model using lm-evaluation-harness, recommended versionlm-eval>=0.4.8

Run script details

bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH

where RESULT_PATH is the directory for saving test results, -b is batch size, --tasks specifies the evaluation tasks, and -n is the number of few-shot examples.

For more detaileds, please refer to the Deployment Documentation.

📈 Benchmark

1. Quantization

The performance test results for selected models are shown below. For the complete benchmark, refer to the Benchmark documentation

1.1 Hunyuan Series Models

Benchmark results for the Hunyuan-Instruct model with FP8, INT4-AWQ and INT4-GPTQ quantization algorithms on datasets includingOlympiadBench, AIME 2024 and DROP:

Model	Quantization	OlympiadBench	AIME 2024	DROP	GPQA-Diamond
Hunyuan-A13B-Instruct	BF16	82.7	87.30	91.1	71.2
	FP8-Static	83.0	86.7	91.1	-
	Int4-GPTQ	82.7	86.7	91.1	-
	Int4-AWQ	82.6	85.6	91.0	-
Hunyuan-7B-Instruct	BF16	76.5	81.1	85.9	60.1
	FP8-Static	76.6	80.9	86.0	60.1
	Int4-GPTQ	76.2	81.0	85.7	60.0
	Int4-AWQ	76.4	80.9	85.9	60.1
Hunyuan-4B-Instruct	BF16	73.1	78.3	78.2	61.1
	FP8-Static	73.1	76.6	78.3	60.2
	Int4-GPTQ	72.9	-	78.1	58.1
	Int4-AWQ	72.8	-	78.2	-
Hunyuan-1.8B-Instruct	BF16	63.4	56.7	76.7	47.2
	FP8-Static	62.5	55.2	75.1	47.7
	Int4-GPTQ	60.9	-	73.0	44.4
	Int4-AWQ	61.7	-	71.7	43.6
Hunyuan-0.5B-Instruct	BF16	29.6	17.2	52.8	23.3
	FP8-Static	29.6	17.2	51.6	22.5
	Int4-GPTQ	26.8	-	50.9	23.3
	Int4-AWQ	26.3	-	48.9	23.3

1.2 Qwen3 Series Models

Benchmark results for Qwen3 series models with FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ quantization algorithms on datasets including CEVAL, MMLU, GSM8K, and HUMANEVAL:

Model	Quantization	CEVAL	MMLU	GSM8K	HUMANEVAL
Qwen3-0.6B	BF16	45.84	47.21	42.99	19.51
	FP8-Static	45.99	46.87	38.06	18.90
	FP8-Dynamic	45.99	46.93	38.29	20.73
	INT8-Dynamic	45.17	46.95	41.17	21.34
Qwen3-8B	BF16	79.27	74.78	87.79	63.41
	FP8-Static	78.23	74.79	86.96	62.20
	FP8-Dynamic	78.45	74.75	87.64	62.80
	INT8-Dynamic	78.01	74.84	86.96	67.07
	INT4-GPTQ	77.19	73.26	86.43	62.20
	INT4-AWQ	76.15	73.59	86.96	63.41
Qwen3-14B	BF16	83.06	78.90	88.40	55.49
	FP8-Static	82.62	78.57	89.46	57.32
	FP8-Dynamic	82.24	78.92	88.32	52.44
	INT8-Dynamic	81.87	78.13	86.28	56.10
	INT4-GPTQ	81.05	78.02	87.34	57.93
	INT4-AWQ	82.02	77.68	84.23	61.59
Qwen3-32B	BF16	86.55	82.00	74.53	37.80
	FP8-Static	86.92	81.78	70.20	39.63
	FP8-Dynamic	86.55	81.89	70.43	38.41
	INT4-GPTQ	86.18	81.01	-	43.29
	INT4-AWQ	86.18	81.54	-	36.59
Qwen3-30B-A3B	BF16	83.66	79.36	89.99	31.71
	FP8-Static	83.95	79.47	89.01	31.10
	FP8-Dynamic	84.10	79.40	89.16	32.93
	INT8-Dynamic	83.36	79.48	89.16	34.15
Qwen3-235B-A22B	BF16	89.60	86.28	85.29	27.44
	FP8-Static	89.67	86.19	86.96	27.44
	FP8-Dynamic	89.67	86.18	85.22	28.05
	INT8-Dynamic	88.93	86.20	86.20	23.78

1.3 DeepSeek Series Models

Benchmark results for DeepSeek-R1-0528 series models with FP8-Block-Wise and W4A8-FP8 quantization algorithms on datasets including GPQA Diamond、AIME 2024、SimpleQA and LiveCodeBench：

Model	Quantization	GPQA Diamond	AIME 2024	SimpleQA	LiveCodeBench
DeepSeek-R1-0528	FP8-Block-Wise	78.28	88.67	27.8	77.1
	W4A8-FP8	77.37	88.67	26.83	78.86

Note

The above results are based on the average of 5 test runs deployed with TRT-LLM

The hyperparameters used during evaluation are as follows:
{
 "top_k": 20,
 "top_p": 0.6,
 "temperature": 0.7,
 "output_seq_len": 32768,
 "max_input_seq_len": 16384
}

1.4 Qwen-VL Series Models

Qwen3-VL Benchmark

Benchmark results for Qwen3VL series models with BF16、FP8-Static and FP8-Dynamic quantization algorithms on datasets including MMMU_VAL、DocVQA_VAL and ChartQA_TEST：

Model	Quantization	MMMU_VAL	DocVQA_VAL	ChartQA_TEST
Qwen3-VL-32B-Instruct	BF16	60.11	96.08	94.64
	FP8-Static	61.22	96.00	94.64
	FP8-Dynamic	60.78	96.19	94.72
Qwen3-VL-30B-A3B-Instruct	BF16	50.44	95.28	95.36
Qwen3-VL-30B-A3B-Instruct	FP8-Dynamic	50.67	95.25	95.20

Qwen2.5VL Benchmark

Benchmark results for Qwen2.5VL series models with BF16、FP8-Static、FP8-Dynamic、INT4-GPTQ、INT4-AWQ quantization algorithms on datasets including MMMU_VAL、DocVQA_VAL and ChartQA_TEST：

Model	Quantization	MMMU_VAL	MMLDocVQA_VALU	ChartQA_TEST
Qwen2.5VL-3B	BF16	47.11	78.57	80.32
	FP8-Static	47.33	79.34	79.68
	FP8-Dynamic	45.99	46.93	38.29
	INT4-GPTQ	46.56	77.20	78.96
	INT4-AWQ	45.78	-	79.60
Qwen2.5VL-7B	BF16	45.44	89.71	84.64
	FP8-Static	47.00	89.83	85.92
	FP8-Dynamic	47.22	89.80	88.64
	INT4-GPTQ	46.67	90.45	-
	INT4-AWQ	45.67	89.28	-
Qwen2.5VL-32B	BF16	57.00	90.03	-
	FP8-Static	57.00	89.88	-
	FP8-Dynamic	56.44	89.88	-
	INT4-GPTQ	55.22	89.80	-
	INT4-AWQ	55.22	90.30	-
Qwen2.5VL-72B	BF16	58.78	94.39	85.60
	FP8-Static	57.89	94.41	85.84
	FP8-Dynamic	58.67	94.38	85.60
	INT4-GPTQ	57.56	94.46	86.48
	INT4-AWQ	58.78	94.19	87.28

1.5 Other Models

Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like CEVAL, MMLU, and GSM8K using quantization strategies including FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ.

Benchmark Experiment Details

Model	Quantization	CEVAL	MMLU	GSM8K
Qwen2.5-1.5B-Instruct	BF16	67.01	60.05	54.28
	FP8-Static	66.27	60.23	-
	FP8-Dynamic	66.79	60.08	51.71
Qwen2.5-7B-Instruct	BF16	81.20	74.55	79.98
	FP8-Static	81.13	74.03	79.30
	FP8-Dynamic	80.31	74.07	79.00
	INT4-GPTQ	79.05	73.05	74.75
	INT4-AWQ	79.35	73.22	79.38
Qwen2.5-32B-Instruct	BF16	87.30	83.21	81.73
	FP8-Static	87.59	83.08	81.58
	FP8-Dynamic	87.30	83.04	81.58
	INT4-GPTQ	86.70	82.45	82.03
	INT4-AWQ	87.00	82.64	-
DeepSeek-R1-Distill-Qwen-7B	BF16	53.49	53.80	75.74
	FP8-Static	53.57	54.17	76.19
	FP8-Dynamic	52.97	54.13	74.15
	INT4-GPTQ	51.86	52.44	75.89
	INT4-AWQ	53.49	53.70	-
DeepSeek-R1-Distill-Qwen-14B	BF16	77.71	74.28	85.67
	FP8-Static	77.56	74.66	86.73
	FP8-Dynamic	76.82	74.63	87.11
	INT4-GPTQ	74.29	72.37	84.61
	INT4-AWQ	74.81	73.00	86.05
DeepSeek-R1-Distill-Qwen-32B	BF16	84.18	80.89	87.41
	FP8-Static	83.43	80.90	87.57
	FP8-Dynamic	83.73	81.10	86.43
	INT4-GPTQ	84.10	79.80	86.73
	INT4-AWQ	82.84	80.15	87.19

2. Speculative Decoding

2.1 Qwen3 Series Models

Benchmark results for Qwen3 series models with Eagle3 speculative decoding algorithm on datasets including MT-bench, HunmanEval, GSM8K, and Alpaca:

		MT-bench		HumanEval		GSM8K		Alpaca		Mean
Temperature	Model	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ
T=0	Qwen3-1.7B	2.05x	2.81	2.07x	2.93	2.11x	2.98	1.93x	2.69	2.04x	2.85
	Qwen3-4B	2.21x	3.01	2.36x	3.24	2.42x	3.13	2.32x	2.75	2.33x	3.03
	Qwen3-8B	2.63x	3.65	2.76x	3.85	2.82x	3.90	2.62x	3.48	2.70x	3.72
	Qwen3-14B	2.23x	3.30	2.53x	3.74	2.56x	3.79	2.16x	3.13	2.37x	3.49
	Qwen3-32B	2.39x	2.78	2.37x	2.81	2.47x	2.92	2.42x	2.53	2.41x	2.76
	Qwen3-30B-A3B	2.84x	3.63	2.27x	3.09	2.64x	3.42	2.83x	3.56	2.64x	3.42
T=1	Qwen3-1.7B	1.74x	2.53	1.86x	2.70	1.82x	2.69	1.72x	2.46	1.93x	2.60
	Qwen3-4B	1.93x	2.60	2.00x	2.84	2.11x	2.82	2.34x	2.50	1.75x	2.69
	Qwen3-8B	1.98x	2.75	2.25x	3.11	2.31x	3.15	2.10x	2.76	2.90x	2.94
	Qwen3-14B	1.71x	2.61	1.95x	2.87	2.04x	3.08	1.68x	2.55	2.90x	2.78
	Qwen3-32B	1.62x	1.91	1.71x	2.05	1.78x	2.10	1.80x	1.95	1.62x	2.00
	Qwen3-30B-A3B	1.91x	2.46	2.00x	2.64	1.90x	2.53	1.80x	2.32	1.90x	2.48

2.2 Hunyuan Series Models

Benchmark results for Hunyuan series models with Eagle3 speculative decoding algorithm on datasets including MT-bench, HunmanEval, GSM8K, and Alpaca:

		MT-bench		HumanEval		GSM8K		Alpaca		Mean
Temperature	Model	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ
T=0	Hunyuan-1.8B-Instruct	1.97x	2.90	2.58x	3.73	2.61x	3.71	1.71x	2.43	2.22x	3.19
	Hunyuan-4B-Instruct	1.77x	2.60	2.64x	3.35	2.14x	3.17	1.72x	2.57	2.07x	2.92
	Hunyuan-7B-Instruct	2.22x	3.58	3.59x	5.47	2.96x	4.68	1.64x	2.56	2.60x	4.07
T=1	Hunyuan-1.8B-Instruct	1.58x	2.36	2.35x	3.56	2.23x	3.38	1.26x	1.87	1.86x	2.79
	Hunyuan-4B-Instruct	1.36x	2.05	1.97x	2.86	1.72x	2.68	1.14x	1.76	1.55x	2.34
	Hunyuan-7B-Instruct	1.90x	3.11	3.12x	5.09	2.74x	4.34	1.47x	2.39	2.31x	3.73

📝 License

The code for this project is open-sourced under the License for AngelSlim.

🔗 Citation

@software{AngelSlim2025,
    title={{AngelSlim}},
    author={Tencent AngelSlim Project Contributors},
    year={2025},
    month={6},
    url={https://github.com/Tencent/AngelSlim},
}

💬 Technical Discussion

AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub Issues or join our WeChat discussion group.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
angelslim		angelslim
configs		configs
dataset		dataset
docs		docs
requirements		requirements
scripts		scripts
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A more accessible, comprehensive, and efficient toolkit for large model compression.

📣Latest News

🌟Key Features

💼Technical Overview

🛎️How to Use

1. Install AngelSlim

2. Quick Start

3. Deployment and Testing

3.1 Offline Inference

3.2 API Service Deployment

3.3 Service Invocation

3.4 Performance Evaluation

📈 Benchmark

1. Quantization

1.1 Hunyuan Series Models

1.2 Qwen3 Series Models

1.3 DeepSeek Series Models

1.4 Qwen-VL Series Models

1.5 Other Models

2. Speculative Decoding

2.1 Qwen3 Series Models

2.2 Hunyuan Series Models

📝 License

🔗 Citation

💬 Technical Discussion

About

Uh oh!

Releases 1

Packages

Contributors 9

Languages

License

Tencent/AngelSlim

Folders and files

Latest commit

History

Repository files navigation

A more accessible, comprehensive, and efficient toolkit for large model compression.

📣Latest News

🌟Key Features

💼Technical Overview

🛎️How to Use

1. Install AngelSlim

2. Quick Start

3. Deployment and Testing

3.1 Offline Inference

3.2 API Service Deployment

3.3 Service Invocation

3.4 Performance Evaluation

📈 Benchmark

1. Quantization

1.1 Hunyuan Series Models

1.2 Qwen3 Series Models

1.3 DeepSeek Series Models

1.4 Qwen-VL Series Models

1.5 Other Models

2. Speculative Decoding

2.1 Qwen3 Series Models

2.2 Hunyuan Series Models

📝 License

🔗 Citation

💬 Technical Discussion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 9

Languages

Packages