Skip to content

Releases: vllm-project/speculators

Speculators v0.3.0

10 Dec 18:18
be6e86e

Choose a tag to compare

image (8) (1)

Speculators v0.3.0 Release Notes

This Speculators v0.3.0 release provides end-to-end training support for Eagle3 speculative decoding draft models.

Key new features include:

  • Offline training data generation support using vLLM
  • Single- and multi-layer draft model training for MoE and non-MoE models
  • End-to-end scripts to generate data, train your draft model, and validate performance in vLLM
  • Examples highlighting training for Llama3, Qwen3, and gpt-oss

Offline Training Data Generation Support

Offline training data generation is now supported through a new hidden-states generator using vLLM. The generator provides support for MoE and non-MoE models. Vision-language support will be added in a future release.
Generated data is saved as individual data_{index}.pt files. Each data point contains input_ids, hidden_states, and loss_mask. Along with the hidden states, a token_freq.pt file is also generated, containing information about token frequencies that is used to build the target-to-draft and draft-to-target vocabulary files required for training. Finally, a data_config.json is produced, containing metadata about the data generation process.

The hidden-states generator includes the following features:

  • Multiprocess executor for efficient batch inference
  • Tensor parallelism support
  • Automatic KV-cache and memory management

The following scripts can be used to enable offline data generation:

Draft Model Training Support ✨

Full training support is now available for single- and multi-layer Eagle3 draft models for both Mixture of Experts (MoE) and non-MoE target models.

Training support includes:

  • Updated Eagle3 draft model definitions with all features required for efficient Eagle3 model training
  • Added logic for Eagle3 algorithm's train-time-testing, now integrated into the Eagle3DraftModel forward method. The forward method now supports dynamic step counts and computes per-step loss and accuracy.
  • New document-masking support enabling fast, memory-efficient Eagle3 draft model training. This approach exploits sparsity in train-time-test attention masks, providing faster performance and lower memory usage compared to a naive full attention matrix.

The following script can be used for training:

End-to-End Scripts and Examples

New E2E script for data generation and training speculative draft models

A summary of the new scripts added to run each of the individual steps in the workflow is listed below:

  1. Generate training data offline
  2. Build Vocab Mapping
  3. Training

A new end-to-end script has also been added that runs the full workflow mentioned above under a single configuration. The script provides a simplified interface for configuring a full training run that can be launched once. Internally, the script runs each step of the process and ensures data flows correctly from one step to the next.

Training examples have been added for Llama3, Qwen3, and gpt-oss:

  1. llama3_8b_sharegpt_5k.py
  2. gpt_oss_20b_ultrachat_5k.py
  3. qwen3_8b_sharegpt_ultrachat.py

Testing and validation

New vLLM benchmarking framework

A new automated evaluation framework that benchmarks Eagle3 speculator models using vLLM and GuideLLM has been added.
Preconfigured evaluation configurations are available for the following models:

  • Llama-3.1-8B
  • Llama-3.3-70B
  • gpt-oss-20B
  • Qwen3-8B
  • Qwen3-32B

The framework can be reviewed in the examples/evaluate/eval-guidellm folder.

To run an evaluation:

./run_evaluation.sh -c configs/llama-3.1-8b-eagle3.env

This command automatically handles vLLM server startup, runs GuideLLM benchmarks, extracts acceptance-rate metrics from logs, and cleans up when complete.

The framework supports multiple dataset types, including HuggingFace datasets with colon syntax for specific files (e.g., org/dataset:file.jsonl), local files, and directories. It includes modular bash scripts following best practices, with proper error handling and process management, configurable sampling parameters (temperature, top_p, top_k), and outputs detailed metrics including weighted per-position acceptance rates and conditional acceptance probabilities.

Configuration precedence for the evaluation run is as follows and can be easily changed:

  1. CLI arguments
  2. Config file
  3. Framework defaults

Deprecations

Previously supported training code under research has been removed.

New Contributors

Full Changelog: v0.2.0...v0.3.0

Speculators v0.2.0

03 Nov 15:10
02212fa

Choose a tag to compare

Speculators v0.2.0 Release Notes

This Speculators v0.2.0 release introduces the following new features and enhancements:

  • Support for Draft Models with Multiple Decoder Layers: Previously, only draft models with a single decoder layer were supported. The Eagle3 converter now sets the num_hidden_layers from the config instead of always assuming one layer.
  • Added Support for eagle_aux_hidden_state_layer_ids Argument: This new argument allows users to toggle the layer IDs of the hidden state layers that are fetched during inference time. This enables support for converting Llama4 Maverick draft models to the Speculators format and running the converted model in vLLM.

Updates and Deprecations:

  • Python 3.9 Support Removed: Support for Python 3.9 has been removed and will no longer be provided. Python 3.10+ will be supported going forward.
  • Default Number of Speculative Tokens Changed: The default number of speculative tokens has been changed from 5 to 3 for all Eagle and Eagle3 models.
  • Override tie_weights() in Eagle3Speculator: This override prevents vocabulary corruption and supports Transformers 4.54.1.
  • Updated head_dim Calculation in Eagle3 Converter: The head_dim value is now used from the config if provided; otherwise, it is calculated using the formula hidden_size // num_heads.
  • Eagle3 Draft Models Retain Original Dtype: All Eagle3 draft models now keep their original dtype after being converted to the Speculators format. Previously, all converted draft models were cast to FP32.
  • Extended Logic for target_vocab_size: The system defaults to using the "t2d" length, but if not available recursively search the verifier model's config file for vocab_size.
  • Full End-to-End vLLM Smoke Testing: Extended and added full end-to-end vLLM smoke testing for both converted and unconverted models.

Full Change Log

New Contributors

Full Changelog: v0.1.0...v0.2.0

Speculators v0.1.0 -- First Public Release

08 Aug 01:45
8a49095

Choose a tag to compare

Overview

This first public release publishes the complete initial codebase for Speculators — a unified library for building, evaluating, converting, and serving speculative decoding algorithms for LLMs. It delivers the core framework, CI/CD and developer workflow, model/config implementations (EAGLE v1/HASS/EAGLE‑3), converter CLIs from external research repos, a Hugging Face–compatible model format with vLLM serving support, and prototype training code.

What’s New (Highlights)

  • Unified, extensible framework for speculator models (build, evaluate, convert, store)
  • Hugging Face–compatible speculator format with serving support landed in vLLM
  • Models/configs for EAGLE v1 (HASS-style), HASS, and EAGLE‑3 (multi-layer types)
  • Checkpoint converter CLIs (Eagle, Eagle‑3) from external research repositories
  • Prototype training code and scripts (EAGLE‑1-style drafter, HASS) + requirements
  • Production readiness: CI/CD, tests, style, docs, examples, and benchmarks

Use Cases Enabled

  • Register and configure new speculator algorithms via a standardized configuration and registry system
  • Convert external checkpoints (EAGLE/EAGLE‑3/HASS variants) into the Speculators format with CLI tools
  • Serve Speculators models directly in vLLM for low‑latency inference
  • Evaluate and benchmark speculators (e.g., with GuideLLM), including quantized verifier swaps
  • Prototype‑train drafters using provided research code and scripts

Getting Started

  • Install (Python 3.9–3.13 on Linux or macOS):
    pip install git+https://github.com/neuralmagic/speculators.git
  • Serve with vLLM (requires v1 API):
    VLLM_USE_V1=1 vllm serve RedHatAI/Qwen3-8B-speculator.eagle3
  • Explore examples and research: examples/, research/eagle3/, research/hass/

Compatibility Notes

  • Python: 3.9–3.13
  • OS: Linux and macOS
  • Transformers pinned to avoid mypy regressions (PR #73)
  • vLLM v1 API required for serving (set VLLM_USE_V1=1)

Full Changelog (v0.1.0)

First public release of Speculators. This release publishes the complete initial codebase and enables the first set of core use cases for speculative decoding with LLMs.

Added

  • Base configuration and registry system with tests: Speculator, Token Proposal, and Model Speculator configs; EagleSpeculatorConfig for EAGLE v1/HASS; config serialization/loading (PRs #26, #27, #28, #29, #34, #36)
  • Eagle speculator model and support for multiple transformer layer types (PRs #37, #49)
  • Eagle‑3 speculator model and Qwen support (PRs #50, #55)
  • Checkpoint converter CLIs: Eagle and Eagle‑3; standardized converter interface (PRs #39, #53, #72)
  • vLLM serving documentation and Qwen benchmark assets (PRs #77, #78, #82, #83)
  • Examples directory and README for getting started (PR #81)
  • Branding assets (icons, logos, user‑flow diagrams) (PR #87)

Changed

  • Standardized converter CLI UX and flags (PR #72)
  • Documentation/readme formatting and content updates (PRs #70, #75, #83, #85)

Fixed

  • Missing embeddings in converted checkpoints/workflows (PR #65)
  • CLI flags and norm_before_residual toggle (PRs #57, #58)
  • Compatibility: pin transformers to resolve mypy/typing regressions (PR #73)

CI/CD and Tooling

  • GitHub Actions: migrated link checks to lychee and updated workflows (PRs #3, #45)
  • PR comment behavior refinements (PR #47)

Research and Training

  • Training code for EAGLE‑1‑style drafter with multi‑step training (PR #35)
  • HASS/EAGLE‑3 research updates, requirements, and DeepSpeed dependency (PRs #64, #67, #69)

Documentation

  • vLLM serving instructions, Qwen benchmark results, examples README, and research readmes (PRs #64, #70, #77, #78, #81, #83, #85)

New Contributors

Thanks also to continuing contributors: @markurtz, @rahul-tuli, @dsikka

Links