Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 3.84 KB

File metadata and controls

27 lines (23 loc) · 3.84 KB
title Qwen2.5 Technical Report
source arxiv
arxiv_id 2412.15115
url https://arxiv.org/abs/2412.15115
authors
An Yang
Baosong Yang
Beichen Zhang
Binyuan Hui
Bo Zheng
Bowen Yu
Chengyuan Li
Dayiheng Liu
Fei Huang
Haoran Wei
Huan Lin
Jian Yang
Jianhong Tu
Jianwei Zhang
Jianxin Yang
Jiaxi Yang
Jingren Zhou
Junyang Lin
Kai Dang
Keming Lu
Keqin Bao
Kexin Yang
Le Yu
Mei Li
Mingfeng Xue
Pei Zhang
Qin Zhu
Rui Men
Runji Lin
Tianhao Li
Tianyi Tang
Tingyu Xia
Xingzhang Ren
Xuancheng Ren
Yang Fan
Yang Su
Yichang Zhang
Yu Wan
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zihan Qiu
published 2024-12-19
categories
cs.CL
primary_category cs.CL
fetched_at 2026-05-29T00:00:00Z
topics
language-models
efficient-architectures
alignment-rlhf
aliases
Qwen2.5 Technical Report
tags
topic/language-models
topic/efficient-architectures
topic/alignment-rlhf
level/frontier
medium/paper
task/language
technique/moe
technique/rlhf
technique/lora-peft
technique/quantization

Abstract

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Why it matters

  • Qwen2.5 scales pre-training data from 7 to 18 trillion tokens, yielding a much stronger knowledge and reasoning foundation across its model family.
  • Post-training uses over 1 million supervised fine-tuning samples plus multistage reinforcement learning, substantially improving instruction following, long-text generation, and structured data analysis.
  • The open-weight 72B instruction-tuned model matches or exceeds Llama-3-405B-Instruct (roughly 5x larger), demonstrating strong parameter efficiency at scale.
  • The MoE-based hosted variants (Qwen2.5-Turbo and Qwen2.5-Plus) compete with GPT-4o-mini and GPT-4o respectively, while also serving as the foundation for specialized models in math, coding, and multimodal tasks.

Source: https://arxiv.org/abs/2412.15115. This entry is the paper's abstract + metadata; read the full paper at the link.