Skip to content

Releases: vllm-project/vllm-spyre

v1.7.0

03 Feb 17:05
d10f934

Choose a tag to compare

v1.7.0 FP8 support with chunked prefill

This release:

  • 📚 Supports FP8 quantization with chunked prefill and prefix caching
  • 🐛 Fixes a bug with our graph comparison tests on spyre
  • ⬆️ Adds vllm 0.14.1 support

What's Changed

Full Changelog: v1.6.1...v1.7.0

v1.6.1

27 Jan 20:49
fc9be6f

Choose a tag to compare

v1.6.1 Dependency-only hotfix release

This release updates the vllm dependency range to correctly include v0.14.0, and removes pytest-mock which was erroneously added as a runtime dependency.

This release contains no code changes from the previous v1.6.0 release

What's Changed

Full Changelog: v1.6.0...v1.6.1

v1.6.0

26 Jan 23:59
0f8e68a

Choose a tag to compare

v1.6.0 - Experimental Multimodal Support

This release:

  • ⚗️ Adds very experimental multimodal support for granite vision models. No performance or accuracy guarantees and no documented support
  • ⬆️ Bumps vllm support to include v0.14.0
  • 🐛 Fixes a bug where structured output requests would crash the server

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.5.1

v1.5.0

16 Jan 19:53
344a1a1

Choose a tag to compare

1.5.0 - Granite 4 dense support

  • ✨ Adds support for granite 4 dense models
  • 🚀 Adds support for vllm 0.13.0
  • 🚀 Adds support for fms 1.6.0
  • 📝 Adds documentation about prefix caching and corrects confusing log messages

What's Changed

Full Changelog: v1.4.3...v1.5.0

v1.4.3

07 Jan 18:21
b35fa6e

Choose a tag to compare

Prefix caching bugfix

  • Fixes a small bug where an incorrect prefix block would be masked out during prefill in some cases

What's Changed

  • Unit tests for chunked prefill model runner with prefix caching by @maxdebayser in #615

Full Changelog: v1.4.2...v1.4.3

v1.4.2

06 Jan 18:08
1754d39

Choose a tag to compare

⚡⚡⚡ Prefix caching performance improvements ⚡⚡⚡

This release greatly improves the performance of prefix caching on vllm-spyre, by:

  • Deduplicating KV cache blocks where matching prefixes cannot be used from cache due to compiler constraints
  • Eliminating idle iterations during cached prefills
  • Increasing the size of the KV cache on standard TP=4 deployments of granite-3.3-8b-instruct

🐛 This also fixes a couple bugs that had the potential to crash the server at runtime when prefix caching was enabled

👀 This fixes the KV cache metrics to reflect the real state of the cache on the spyre cards

What's Changed

Full Changelog: v1.4.1...v1.4.2

v1.4.1

16 Dec 16:29
2232698

Choose a tag to compare

v1.4.1: vLLM compatibility release

  • Adds compatibility for vllm <= 0.12.0
  • Fixes a bug where vllm 0.10.2 would crash

What's Changed

Full Changelog: v1.4.0...v1.4.1

v1.4.0

10 Dec 15:09
9824186

Choose a tag to compare

v1.4.0 - Initial Prefix Caching Support

This Release:

  • Adds an experimental implementation of prefix caching
  • Adds support for vllm v0.11.1 and updates the default locked version
  • Upgrades the AFTU dev dependency lock to v0.5.0 to support chunked prefill tests on spyre hardware

What's Changed

Full Changelog: v1.3.0...v1.4.0

v1.3.0

09 Dec 16:22
30fcc38

Choose a tag to compare

This release adds support for Chunked Prefill for non-quantized models that can be enabled with:

VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1

What's Changed

Full Changelog: v1.2.3...v1.3.0

v1.2.3

12 Nov 17:46
9d049db

Choose a tag to compare

Includes a change required to support Torch >= 2.8

What's Changed

  • fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551

Full Changelog: v1.2.2...v1.2.3