Releases: vllm-project/vllm-spyre
v1.7.0
v1.7.0 FP8 support with chunked prefill
This release:
- 📚 Supports FP8 quantization with chunked prefill and prefix caching
- 🐛 Fixes a bug with our graph comparison tests on spyre
- ⬆️ Adds vllm 0.14.1 support
What's Changed
- ⬆️ Support vllm 0.14.1 by @joerunde in #663
- test: fix test_compare_graphs by @tjohnson31415 in #671
- ⚡ Implement fp8 with chunked prefill with static scaling by @joerunde in #661
Full Changelog: v1.6.1...v1.7.0
v1.6.1
v1.6.1 Dependency-only hotfix release
This release updates the vllm dependency range to correctly include v0.14.0, and removes pytest-mock which was erroneously added as a runtime dependency.
This release contains no code changes from the previous v1.6.0 release
What's Changed
Full Changelog: v1.6.0...v1.6.1
v1.6.0
v1.6.0 - Experimental Multimodal Support
This release:
- ⚗️ Adds very experimental multimodal support for granite vision models. No performance or accuracy guarantees and no documented support
- ⬆️ Bumps vllm support to include v0.14.0
- 🐛 Fixes a bug where structured output requests would crash the server
What's Changed
- 📝 Update granite logs by @joerunde in #637
- ⚡ Use ty for type checking instead of mypy by @joerunde in #620
- 🔧 set 32MB HDMA collective override by @joerunde in #638
- [CP][PC] rip out block reservation for chunked prefill by @yannicks1 in #622
- Multimodal / Granite Vision Support by @alex-jw-brooks in #614
- Add v0.14.0 support by @rafvasq in #636
- 🔥 remove vllm 0.10.2 compatibility by @tjohnson31415 in #660
- fix: drop structured output from request to avoid crash by @tjohnson31415 in #657
New Contributors
- @alex-jw-brooks made their first contribution in #614
Full Changelog: v1.5.0...v1.5.1
v1.5.0
1.5.0 - Granite 4 dense support
- ✨ Adds support for granite 4 dense models
- 🚀 Adds support for vllm 0.13.0
- 🚀 Adds support for fms 1.6.0
- 📝 Adds documentation about prefix caching and corrects confusing log messages
What's Changed
- Fix broken references by @rafvasq in #624
- Set default max_num_batched_tokens to 1k for CP by @maxdebayser in #619
- 🚀 Add vllm 0.13.0 support by @joerunde in #623
- 📝 update docs on batching modes by @joerunde in #629
- Update fms to version 1.6.0 by @maxdebayser in #631
- 🔒 uv lock fms 1.6.0 by @joerunde in #632
- Update pyproject vllm max version to 0.13.0 by @tjohnson31415 in #633
- Update logging around chunked prefill enablement by @tjohnson31415 in #628
- ✨ support granite 4 dense by @joerunde in #635
Full Changelog: v1.4.3...v1.5.0
v1.4.3
Prefix caching bugfix
- Fixes a small bug where an incorrect prefix block would be masked out during prefill in some cases
What's Changed
- Unit tests for chunked prefill model runner with prefix caching by @maxdebayser in #615
Full Changelog: v1.4.2...v1.4.3
v1.4.2
⚡⚡⚡ Prefix caching performance improvements ⚡⚡⚡
This release greatly improves the performance of prefix caching on vllm-spyre, by:
- Deduplicating KV cache blocks where matching prefixes cannot be used from cache due to compiler constraints
- Eliminating idle iterations during cached prefills
- Increasing the size of the KV cache on standard TP=4 deployments of granite-3.3-8b-instruct
🐛 This also fixes a couple bugs that had the potential to crash the server at runtime when prefix caching was enabled
👀 This fixes the KV cache metrics to reflect the real state of the cache on the spyre cards
What's Changed
- [PC][test] add additional test cases by @yannicks1 in #591
- ⚗️ try to correct the KV cache stats by @joerunde in #609
- Deduplicate prefill by @maxdebayser in #603
- [CP][CB][SB] improve vllm debug logs by @yannicks1 in #612
- [CP] Fix TKV constraint bug by @sducouedic in #608
- 🎨 Swap to ruff and prek by @joerunde in #532
- 📝 Update docs and scripts for new prek integration by @joerunde in #617
- ⚡ Reduce idle prefills for better throughput by @joerunde in #613
- 🐛 fix critical prefix cache bug by @joerunde in #616
- Fix compatibility issue when trust_remote_code==True by @maxdebayser in #618
- Increase default KV cache size for granite 8b by @yannicks1 in #556
Full Changelog: v1.4.1...v1.4.2
v1.4.1
v1.4.1: vLLM compatibility release
- Adds compatibility for vllm <= 0.12.0
- Fixes a bug where vllm 0.10.2 would crash
What's Changed
- ⬆️ bump vllm to 0.11.2 by @joerunde in #600
- [Docs] Remove readthedocs config, update site/contributing link by @rafvasq in #601
- [fix] backwards compatibility for 0.10.2 by @yannicks1 in #605
- doc: fix readme link protocol by @tjohnson31415 in #602
- ⬆️ Add support for vllm 0.12.0 by @joerunde in #604
Full Changelog: v1.4.0...v1.4.1
v1.4.0
v1.4.0 - Initial Prefix Caching Support
This Release:
- Adds an experimental implementation of prefix caching
- Adds support for vllm v0.11.1 and updates the default locked version
- Upgrades the AFTU dev dependency lock to v0.5.0 to support chunked prefill tests on spyre hardware
What's Changed
- ✨ vllm support for 0.11.1 release by @joerunde in #546
- 🔥 remove prints by @joerunde in #594
- ⬆️ aftu to v0.5.0 by @tjohnson31415 in #595
- 🔥 Limit CI usage by @joerunde in #596
- 🐛 Fix main CI by @joerunde in #598
- ✨ Add prefix caching by @maxdebayser in #586
- 📝 add doc section on VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS by @tjohnson31415 in #593
Full Changelog: v1.3.0...v1.4.0
v1.3.0
This release adds support for Chunked Prefill for non-quantized models that can be enabled with:
VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1
What's Changed
- feat: chunked prefill spyre model runner by @wallashss in #552
- [tests] cleanup: remove temporary hack by @yannicks1 in #555
- ChunkedPrefillSpyreScheduler: No Interleaving by @sducouedic in #554
- fix: left padding of prompts less than chunk size by @wallashss in #557
- feat: left padding from model runner to scheduler by @wallashss in #559
- [CP] rewrite scheduler constraints for chunked prefill (🐛 fix) by @yannicks1 in #560
- [CB] remove decode/prefill prioritization heuristic by @yannicks1 in #561
- Bugfix: padding block cannot be reused with chunked prefill by @sducouedic in #563
- [CP] scheduler constraints typo by @yannicks1 in #565
- test: add maybe_xfail for quantized micro static batch logprobs checks by @tjohnson31415 in #566
- [CP] fix empty model runner output by @yannicks1 in #570
- [CB] remove env var VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by @yannicks1 in #562
- [CP] optimal chunked prefill scheduler constraints by @yannicks1 in #564
- [CP] Simplify code by @maxdebayser in #572
- [CB] tighten constraint max model length decode sequences by @yannicks1 in #573
- Set default chunk size to 4k for granite 3 8b TP4 by @tjohnson31415 in #571
- Interleave chunked prefills with single decoding steps by @sducouedic in #558
- feat/fix: add finish_requests to handle removal from ongoing_prefills by @tjohnson31415 in #577
- fix: check only decoding requests in _satisfies_last_chunk_constraints by @tjohnson31415 in #576
- tests: include chunked prefill on existing tests by @wallashss in #574
- Add step tests for chunked prefill by @maxdebayser in #575
- docs: chunked prefill updated documentation by @wallashss in #578
- [Docs] Prep and publish GH Pages doc by @rafvasq in #579
- [Docs] Update GH artifact versions by @rafvasq in #581
- [Docs] Add workflow files to docs action triggers by @rafvasq in #582
- [Docs] Avoid multiple artifacts by @rafvasq in #583
- fix test_compare_graphs_chunked_prefill by @tjohnson31415 in #580
- [Docs] Use mkdocs gh-deploy by @rafvasq in #584
- [Docs] Update links to documentation by @rafvasq in #587
- [PC] Refactor CB model runner to use vLLMs block pool by @maxdebayser in #585
- [Docs] Add note about move by @rafvasq in #589
- test: a few test configuration updates to have chunked prefill tests pass on Spyre by @tjohnson31415 in #588
- Update default granite 8b chunk size to 1024 by @tjohnson31415 in #592
- fix time logging and other small things by @yannicks1 in #590
Full Changelog: v1.2.3...v1.3.0
v1.2.3
Includes a change required to support Torch >= 2.8
What's Changed
- fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551
Full Changelog: v1.2.2...v1.2.3