03 Feb 17:05

joerunde

d10f934

v1.7.0 Latest

Latest

v1.7.0 FP8 support with chunked prefill

This release:

📚 Supports FP8 quantization with chunked prefill and prefix caching
🐛 Fixes a bug with our graph comparison tests on spyre
⬆️ Adds vllm 0.14.1 support

What's Changed

⬆️ Support vllm 0.14.1 by @joerunde in #663
test: fix test_compare_graphs by @tjohnson31415 in #671
⚡ Implement fp8 with chunked prefill with static scaling by @joerunde in #661

Full Changelog: v1.6.1...v1.7.0

Contributors

joerunde and tjohnson31415

Assets 2

27 Jan 20:49

joerunde

v1.6.1

fc9be6f

v1.6.1

v1.6.1 Dependency-only hotfix release

This release updates the vllm dependency range to correctly include v0.14.0, and removes pytest-mock which was erroneously added as a runtime dependency.

This release contains no code changes from the previous v1.6.0 release

What's Changed

🐛 fixup version range by @joerunde in #662

Full Changelog: v1.6.0...v1.6.1

Contributors

joerunde

Assets 2

26 Jan 23:59

joerunde

v1.6.0

0f8e68a

v1.6.0

v1.6.0 - Experimental Multimodal Support

This release:

⚗️ Adds very experimental multimodal support for granite vision models. No performance or accuracy guarantees and no documented support
⬆️ Bumps vllm support to include v0.14.0
🐛 Fixes a bug where structured output requests would crash the server

What's Changed

📝 Update granite logs by @joerunde in #637
⚡ Use ty for type checking instead of mypy by @joerunde in #620
🔧 set 32MB HDMA collective override by @joerunde in #638
[CP][PC] rip out block reservation for chunked prefill by @yannicks1 in #622
Multimodal / Granite Vision Support by @alex-jw-brooks in #614
Add v0.14.0 support by @rafvasq in #636
🔥 remove vllm 0.10.2 compatibility by @tjohnson31415 in #660
fix: drop structured output from request to avoid crash by @tjohnson31415 in #657

New Contributors

@alex-jw-brooks made their first contribution in #614

Full Changelog: v1.5.0...v1.5.1

Contributors

joerunde, tjohnson31415, and 3 other contributors

Assets 2

16 Jan 19:53

joerunde

v1.5.0

344a1a1

v1.5.0

1.5.0 - Granite 4 dense support

✨ Adds support for granite 4 dense models
🚀 Adds support for vllm 0.13.0
🚀 Adds support for fms 1.6.0
📝 Adds documentation about prefix caching and corrects confusing log messages

What's Changed

Fix broken references by @rafvasq in #624
Set default max_num_batched_tokens to 1k for CP by @maxdebayser in #619
🚀 Add vllm 0.13.0 support by @joerunde in #623
📝 update docs on batching modes by @joerunde in #629
Update fms to version 1.6.0 by @maxdebayser in #631
🔒 uv lock fms 1.6.0 by @joerunde in #632
Update pyproject vllm max version to 0.13.0 by @tjohnson31415 in #633
Update logging around chunked prefill enablement by @tjohnson31415 in #628
✨ support granite 4 dense by @joerunde in #635

Full Changelog: v1.4.3...v1.5.0

Contributors

maxdebayser, joerunde, and 2 other contributors

Assets 2

07 Jan 18:21

joerunde

v1.4.3

b35fa6e

v1.4.3

Prefix caching bugfix

Fixes a small bug where an incorrect prefix block would be masked out during prefill in some cases

What's Changed

Unit tests for chunked prefill model runner with prefix caching by @maxdebayser in #615

Full Changelog: v1.4.2...v1.4.3

Contributors

maxdebayser

Assets 2

06 Jan 18:08

joerunde

v1.4.2

1754d39

v1.4.2

⚡⚡⚡ Prefix caching performance improvements ⚡⚡⚡

This release greatly improves the performance of prefix caching on vllm-spyre, by:

Deduplicating KV cache blocks where matching prefixes cannot be used from cache due to compiler constraints
Eliminating idle iterations during cached prefills
Increasing the size of the KV cache on standard TP=4 deployments of granite-3.3-8b-instruct

🐛 This also fixes a couple bugs that had the potential to crash the server at runtime when prefix caching was enabled

👀 This fixes the KV cache metrics to reflect the real state of the cache on the spyre cards

What's Changed

[PC][test] add additional test cases by @yannicks1 in #591
⚗️ try to correct the KV cache stats by @joerunde in #609
Deduplicate prefill by @maxdebayser in #603
[CP][CB][SB] improve vllm debug logs by @yannicks1 in #612
[CP] Fix TKV constraint bug by @sducouedic in #608
🎨 Swap to ruff and prek by @joerunde in #532
📝 Update docs and scripts for new prek integration by @joerunde in #617
⚡ Reduce idle prefills for better throughput by @joerunde in #613
🐛 fix critical prefix cache bug by @joerunde in #616
Fix compatibility issue when trust_remote_code==True by @maxdebayser in #618
Increase default KV cache size for granite 8b by @yannicks1 in #556

Full Changelog: v1.4.1...v1.4.2

Contributors

maxdebayser, joerunde, and 2 other contributors

Assets 2

16 Dec 16:29

joerunde

v1.4.1

2232698

v1.4.1

v1.4.1: vLLM compatibility release

Adds compatibility for vllm <= 0.12.0
Fixes a bug where vllm 0.10.2 would crash

What's Changed

⬆️ bump vllm to 0.11.2 by @joerunde in #600
[Docs] Remove readthedocs config, update site/contributing link by @rafvasq in #601
[fix] backwards compatibility for 0.10.2 by @yannicks1 in #605
doc: fix readme link protocol by @tjohnson31415 in #602
⬆️ Add support for vllm 0.12.0 by @joerunde in #604

Full Changelog: v1.4.0...v1.4.1

Contributors

joerunde, tjohnson31415, and 2 other contributors

Assets 2

10 Dec 15:09

joerunde

v1.4.0

9824186

v1.4.0

v1.4.0 - Initial Prefix Caching Support

This Release:

Adds an experimental implementation of prefix caching
Adds support for vllm v0.11.1 and updates the default locked version
Upgrades the AFTU dev dependency lock to v0.5.0 to support chunked prefill tests on spyre hardware

What's Changed

✨ vllm support for 0.11.1 release by @joerunde in #546
🔥 remove prints by @joerunde in #594
⬆️ aftu to v0.5.0 by @tjohnson31415 in #595
🔥 Limit CI usage by @joerunde in #596
🐛 Fix main CI by @joerunde in #598
✨ Add prefix caching by @maxdebayser in #586
📝 add doc section on VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS by @tjohnson31415 in #593

Full Changelog: v1.3.0...v1.4.0

Contributors

maxdebayser, joerunde, and tjohnson31415

Assets 2

09 Dec 16:22

tjohnson31415

v1.3.0

30fcc38

v1.3.0

This release adds support for Chunked Prefill for non-quantized models that can be enabled with:

VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1

What's Changed

feat: chunked prefill spyre model runner by @wallashss in #552
[tests] cleanup: remove temporary hack by @yannicks1 in #555
ChunkedPrefillSpyreScheduler: No Interleaving by @sducouedic in #554
fix: left padding of prompts less than chunk size by @wallashss in #557
feat: left padding from model runner to scheduler by @wallashss in #559
[CP] rewrite scheduler constraints for chunked prefill (🐛 fix) by @yannicks1 in #560
[CB] remove decode/prefill prioritization heuristic by @yannicks1 in #561
Bugfix: padding block cannot be reused with chunked prefill by @sducouedic in #563
[CP] scheduler constraints typo by @yannicks1 in #565
test: add maybe_xfail for quantized micro static batch logprobs checks by @tjohnson31415 in #566
[CP] fix empty model runner output by @yannicks1 in #570
[CB] remove env var VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by @yannicks1 in #562
[CP] optimal chunked prefill scheduler constraints by @yannicks1 in #564
[CP] Simplify code by @maxdebayser in #572
[CB] tighten constraint max model length decode sequences by @yannicks1 in #573
Set default chunk size to 4k for granite 3 8b TP4 by @tjohnson31415 in #571
Interleave chunked prefills with single decoding steps by @sducouedic in #558
feat/fix: add finish_requests to handle removal from ongoing_prefills by @tjohnson31415 in #577
fix: check only decoding requests in _satisfies_last_chunk_constraints by @tjohnson31415 in #576
tests: include chunked prefill on existing tests by @wallashss in #574
Add step tests for chunked prefill by @maxdebayser in #575
docs: chunked prefill updated documentation by @wallashss in #578
[Docs] Prep and publish GH Pages doc by @rafvasq in #579
[Docs] Update GH artifact versions by @rafvasq in #581
[Docs] Add workflow files to docs action triggers by @rafvasq in #582
[Docs] Avoid multiple artifacts by @rafvasq in #583
fix test_compare_graphs_chunked_prefill by @tjohnson31415 in #580
[Docs] Use mkdocs gh-deploy by @rafvasq in #584
[Docs] Update links to documentation by @rafvasq in #587
[PC] Refactor CB model runner to use vLLMs block pool by @maxdebayser in #585
[Docs] Add note about move by @rafvasq in #589
test: a few test configuration updates to have chunked prefill tests pass on Spyre by @tjohnson31415 in #588
Update default granite 8b chunk size to 1024 by @tjohnson31415 in #592
fix time logging and other small things by @yannicks1 in #590

Full Changelog: v1.2.3...v1.3.0

Contributors

maxdebayser, wallashss, and 4 other contributors

Assets 2

12 Nov 17:46

tjohnson31415

v1.2.3

9d049db

v1.2.3

Includes a change required to support Torch >= 2.8

What's Changed

fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551

Full Changelog: v1.2.2...v1.2.3

Contributors

tjohnson31415

Assets 2

Releases: vllm-project/vllm-spyre

v1.7.0

v1.7.0 FP8 support with chunked prefill

What's Changed

Contributors

Uh oh!

v1.6.1

v1.6.1 Dependency-only hotfix release

What's Changed

Contributors

Uh oh!

v1.6.0

v1.6.0 - Experimental Multimodal Support

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.0

1.5.0 - Granite 4 dense support

What's Changed

Contributors

Uh oh!

v1.4.3

Prefix caching bugfix

What's Changed

Contributors

Uh oh!

v1.4.2

⚡⚡⚡ Prefix caching performance improvements ⚡⚡⚡

What's Changed

Contributors

Uh oh!

v1.4.1

v1.4.1: vLLM compatibility release

What's Changed

Contributors

Uh oh!

v1.4.0

v1.4.0 - Initial Prefix Caching Support

What's Changed

Contributors

Uh oh!

v1.3.0

What's Changed

Contributors

Uh oh!

v1.2.3

What's Changed

Contributors

Uh oh!