v1.3.0
This release adds support for Chunked Prefill for non-quantized models that can be enabled with:
VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1
What's Changed
- feat: chunked prefill spyre model runner by @wallashss in #552
- [tests] cleanup: remove temporary hack by @yannicks1 in #555
- ChunkedPrefillSpyreScheduler: No Interleaving by @sducouedic in #554
- fix: left padding of prompts less than chunk size by @wallashss in #557
- feat: left padding from model runner to scheduler by @wallashss in #559
- [CP] rewrite scheduler constraints for chunked prefill (🐛 fix) by @yannicks1 in #560
- [CB] remove decode/prefill prioritization heuristic by @yannicks1 in #561
- Bugfix: padding block cannot be reused with chunked prefill by @sducouedic in #563
- [CP] scheduler constraints typo by @yannicks1 in #565
- test: add maybe_xfail for quantized micro static batch logprobs checks by @tjohnson31415 in #566
- [CP] fix empty model runner output by @yannicks1 in #570
- [CB] remove env var VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by @yannicks1 in #562
- [CP] optimal chunked prefill scheduler constraints by @yannicks1 in #564
- [CP] Simplify code by @maxdebayser in #572
- [CB] tighten constraint max model length decode sequences by @yannicks1 in #573
- Set default chunk size to 4k for granite 3 8b TP4 by @tjohnson31415 in #571
- Interleave chunked prefills with single decoding steps by @sducouedic in #558
- feat/fix: add finish_requests to handle removal from ongoing_prefills by @tjohnson31415 in #577
- fix: check only decoding requests in _satisfies_last_chunk_constraints by @tjohnson31415 in #576
- tests: include chunked prefill on existing tests by @wallashss in #574
- Add step tests for chunked prefill by @maxdebayser in #575
- docs: chunked prefill updated documentation by @wallashss in #578
- [Docs] Prep and publish GH Pages doc by @rafvasq in #579
- [Docs] Update GH artifact versions by @rafvasq in #581
- [Docs] Add workflow files to docs action triggers by @rafvasq in #582
- [Docs] Avoid multiple artifacts by @rafvasq in #583
- fix test_compare_graphs_chunked_prefill by @tjohnson31415 in #580
- [Docs] Use mkdocs gh-deploy by @rafvasq in #584
- [Docs] Update links to documentation by @rafvasq in #587
- [PC] Refactor CB model runner to use vLLMs block pool by @maxdebayser in #585
- [Docs] Add note about move by @rafvasq in #589
- test: a few test configuration updates to have chunked prefill tests pass on Spyre by @tjohnson31415 in #588
- Update default granite 8b chunk size to 1024 by @tjohnson31415 in #592
- fix time logging and other small things by @yannicks1 in #590
Full Changelog: v1.2.3...v1.3.0