Skip to content

Dogacel/Attention-Drift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attention Drift

Code for the paper Attention Drift: What Speculative Decoding Models Learn.

Paper | Models

Overview

To run experiments, you can use a Conda environment with a dedicated GPU or Google Colab. For minimal reproduction, use Llama 3.1 8B. Need at least 24GB VRAM for most experiments, which is L4 GPU on Colab.

Source for drafter and verifier is under src/ directory. There are bunch of snippets under snippets that I've carefully written step by step to build an efficient speculative decoding code from scratch to learn how it works. They are numbered from 0 to 13. We start with a single token verification on snippet 0, and build our way up to a full tree-decoding on snippet 12. Each individual snippet is runnable with a simple command. For example a simple snippet to make sure models are loadable and runnable on your device, try:

python src/snippets/00_models_work_test.py

Feel free to fork any snippet to run some experiments. Also I found feeding the relevant snippet to the AI agents and ask them to write specific experiments usually work pretty well.

  • Some exploratory work done using notebooks is located under notebooks/ directory. For example you can find the preliminary work on attention sinks under attention.ipynb, where we visualize the attention sink and attention drift phenomenon on speculative decoding models.

  • Experiments that are supposed to be reported in paper are located under experiments/ folder. Currently experiments are heavily AI assisted and preliminary. Some might be also irrelevant.

  • Experiments that are incomplete or AI generated and not polished are under scratchpad/.

One general rule of thumb is to make sure commonly imported stuff lives under src/ and no experiment, notebook or snippet depends to each other. They are all designed as individually runnable and reproducable units.

Running

Google Colab

I recommend using Google Colab extension on VSCode for easy file transfers. (https://github.com/googlecolab/colab-vscode/wiki/User-Guide)

  1. Connect to a GPU instance (at least L4 is recommended).
  2. Mount the file to the drive and copy all your sources.
  3. Run !/usr/bin/python3 -m pip install -e . on a colab connected notebook.
  4. Restart kernel, so the imports can be resolved.

You can either dump individual snippets or examples directly into a notebook, or try to use experimental terminal access feature of google colab.

On-premise GPU

If you have direct access to a node with GPU, you can directly run the experiments by installing the environment using Conda with the commands,

  1. conda env create --prefix /tmp/$USER/spec-drift -f environment.yml
  2. conda activate /tmp/$USER/spec-drift
  3. python -m pip install -e .

I recommend re-creating env on /tmp on shared clusters, because it usually resides on device disk and has much lower latency, whereas using GPFS might slow down your imports and experiments quite a bit.

Reproducing Paper Results

Experiments to reproduce paper results are available under experiments/ folder. Please take a look at each experiment file to learn more about run instructions. Code for some minor experiments can also be found under scratchpad/ folder. Also check notebooks as some notebooks consist code for the visualizations used in the paper.

Trained Models

Drafter checkpoints used in experiments are available on HuggingFace: https://huggingface.co/collections/Dogacel/attention-drift

We also share two fine-tuned production ready models for gpt-oss-20b and gpt-oss-120b:

https://huggingface.co/collections/Dogacel/specdrift

image

Those models are supported on the nightly SGLang version. For more details on deployment check the model card.

Citation

@misc{eldenk2026attention,
      title={Attention Drift: What Autoregressive Speculative Decoding Models Learn}, 
      author={Doğaç Eldenk and Payal Mohapatra and Yigitcan Comlek and Kaan Oktay and Hongyang Zhang and Stephen Xia},
      year={2026},
      eprint={2605.09992},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.09992}, 
}

Acknowledgments

We thank Fal.AI and Lambda Labs for compute grants that supported this research.

About

Code for the paper *Attention Drift: What Speculative Decoding Models Learn*.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors