The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

This repository contains the official code for the paper The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Authors: Jinbo Wang*, Mingze Wang*, Zhanpeng Zhou*, Junchi Yan, Weinan E, Lei Wu

Motivation

Can adaptive layerwise learning rates accelerate Transformer training?

We investigate this question and make the following contributions:

uncover a clear sharpness disparity across diverse building blocks of Transformers. (not simply layerwise)
propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block’s sharpness, accelerating LLM pre-training.

The Sharpness Disparity Principle

Transformers exhibit distinctive alloy-like characteristics, where diverse types of blocks synergistically combine to achieve superior performance. These blocks include: self-attention (further broken down into query-key (QK), value-output (VO)), point-wise feedforward networks (FFN), normalization layers (Norm), and embedding layers (Emb).

Key Finding

By measuring the average sharpness for each block type using Hessian approximation, we uncover a clear sharpness disparity across above blocks (see Figure left):

$$ \mathcal{S}(\textnormal{\textsf{Emb}}) ≪ \mathcal{S}(\textnormal{\textsf{QK}}) < \mathcal{S}(\textnormal{\textsf{FFN}}) < \mathcal{S}(\textnormal{\textsf{VO}}) ≪ \mathcal{S}(\textnormal{\textsf{Norm}}) $$

This disparity emerges early and persists throughout training.

Figure. (left) The average sharpness for the five typical block types in a pre-trained LLaMA model (0.25B) ; (right) the sharpness distribution across different blocks in a pre-trained GPT-2 (small) model.

Figure. In these experiments, the total training steps are both 50k. The Sharpness Disparity Principle emerges during the initial phase (from iteration 0 to iteration 1k), which accounts for only approximately 2% of the total steps, and persists throughout the subsequent training process.

The Blockwise Learning Rate Strategy

Methodology

We propose a Blockwise Learning Rate (LR) strategy that adjusts the learning rate at the block-type level within the Transformer architecture. Each block type is assigned a different LR multiplier based on its characteristic sharpness:

$$ \eta_\textsf{Norm} = \eta_\textsf{base}, \eta_\textsf{VO} = 4\eta_\textsf{base}, \eta_\textsf{FFN} = 6\eta_\textsf{base}, \eta_\textsf{QK} = 8\eta_\textsf{base}, \eta_\textsf{Embed} = 10\eta_\textsf{base} $$

These ratios are guided by sharpness disparity and grounded in the theoretical foundation of the fast-slow dynamics, which suggests increasing the LRs along low-sharpness directions while maintaining stability by preserving the LRs along high-sharpness directions. Each block type is assigned a different LR multiplier based on its sharpness, the specific multipliers above represent a tuned configuration.

Experiments

Main results

We consistently achieve lower terminal loss and nearly 2x speedup in a wide range of experiments!

Figure. AdamW with Blockwise LR outperforms AdamW across different model types, model sizes, and datasets.

Figure. Adam-mini with Blockwise LR outperforms Adam-mini in pre-training tasks.

Figure. In pre-training tasks, (left) Lion with Blockwise LR outperforms Lion; (right) when using wsd LR scheduler, AdamW with Blockwise LR outperforms AdamW.

Ablation Study

Table. Ablation results for the effectiveness of Blockwise LR in pre-training LLaMA (0.25B) on OpenWebText.

Scaling law experiment

Figure. Scaling law on C4 (under Chinchilla setting); Scaling law on OpenWebText (with fixed training tokens)

Citation

@inproceedings{
wang2025the,
title={The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training},
author={Jinbo Wang and Mingze Wang and Zhanpeng Zhou and Junchi Yan and Weinan E and Lei Wu},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=DZ6iFdVDrx}
}

Repository Goals

This implementation is not optimized for memory or compute performance: the main goal is to illustrate how blockwise learning rate works, not to achieve a production-ready code. It does not contain code for sharpness statistics currently.

Like nanoGPT, it might be beneficial to keep this code stable (with fixing only bugs) so that it can serve as a consistent reference implementation.

Special Thanks: Many thanks to Andrej Karpathy for creating the nanoGPT, which serves as a foundational component of this project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
imgs		imgs
.gitignore		.gitignore
README.md		README.md
adam_mini.py		adam_mini.py
lion.py		lion.py
modelling_gpt_new.py		modelling_gpt_new.py
modelling_llama_new.py		modelling_llama_new.py
train_c4_bw.py		train_c4_bw.py
train_gpt_bw.py		train_gpt_bw.py
train_llama_bw.py		train_llama_bw.py
train_llama_bw_wsd.py		train_llama_bw_wsd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Motivation

The Sharpness Disparity Principle

The Blockwise Learning Rate Strategy

Methodology

Experiments

Main results

Ablation Study

Scaling law experiment

Citation

Repository Goals

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Wongboo/BlockwiseLearningRate

Folders and files

Latest commit

History

Repository files navigation

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Motivation

The Sharpness Disparity Principle

The Blockwise Learning Rate Strategy

Methodology

Experiments

Main results

Ablation Study

Scaling law experiment

Citation

Repository Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages