fix: warn about non-obvious chunked prefill budget behavior#29531
Open
ntny wants to merge 1 commit into
Open
Conversation
chunked_prefill_size is easy to read as only a per-request chunk size, while max_prefill_tokens looks like the batch-level budget. In practice, when chunked prefill is enabled, both parameters cap the prefill batch budget and the effective limit is min(max_prefill_tokens, chunked_prefill_size). Add a warning for the confusing case where max_prefill_tokens is greater than chunked_prefill_size, so users can understand the non-obvious effective limit from startup logs. Signed-off-by: Anton Pechenin <ntny1986@gmail.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Open
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
chunked_prefill_sizecan be read as only a per-request chunk size, whilemax_prefill_tokenslooks like the batch-level budget. In practice, when chunked prefill is enabled, both parameters cap the prefill batch budget, and the effective limit ismin(max_prefill_tokens, chunked_prefill_size).Modifications
Added a startup warning when
max_prefill_tokens > chunked_prefill_sizeto make the effective prefill budget explicit.Accuracy Tests
Not applicable. This change only adds a warning.
Speed Tests and Profiling
Not applicable. This change only adds a warning.
Checklist
Not applicable. This change only adds a warning.
CI States
Latest PR Test (Base): ⏳ Run #28304069792
Latest PR Test (Extra): ⏳ Run #28304069702