Initial version of prefix caching for Llama3 70B Galaxy#35904
Draft
viktorpusTT wants to merge 18 commits intomainfrom
Draft
Initial version of prefix caching for Llama3 70B Galaxy#35904viktorpusTT wants to merge 18 commits intomainfrom
viktorpusTT wants to merge 18 commits intomainfrom
Conversation
ca1eac3 to
134230d
Compare
134230d to
638453b
Compare
688fdd4 to
250c8c8
Compare
250c8c8 to
5e92be5
Compare
… debug prints cleanup.
d545675 to
8dac492
Compare
| prefill_chunk_start_idx = chunk_start_idx | ||
| prefill_start_pos = chunk_start_idx # Python int for attention (SDPA path, program config) | ||
| prefill_get_last_token = last_token_idx_relative # RELATIVE index for slicing within chunk | ||
| prefill_last_token_idx = last_token_idx_relative |
| prefill_chunk_start_idx = 0 | ||
| prefill_start_pos = 0 | ||
| prefill_get_last_token = last_token_idx | ||
| prefill_last_token_idx = last_token_idx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ticket
tenstorrent/vllm#268
Problem description
Automatic prefix caching (https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/#enabling-apc-in-vllm) allows for re-use of cached KV entries when a new prompt arrives which shares its prefix with a previous prompt which was already processed. It will help reduce time to first token (TTFT) significantly when there are multiple users with shared prefixes and when a conversation with a user is continued over many turns.
What's changed
After previous changes:
This PR adds support for Llama 70B Galaxy optimized model.
Checklist
Model tests
If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers
models-mandatoryandmodels-extendedpresets.The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.
models-mandatorypreset (runs: Device perf regressions and Frequent model and ttnn tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Unit tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Quick tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)