Skip to content

Initial version of prefix caching for Llama3 70B Galaxy#35904

Draft
viktorpusTT wants to merge 18 commits intomainfrom
viktorpus/70b-galaxy-prefix-caching
Draft

Initial version of prefix caching for Llama3 70B Galaxy#35904
viktorpusTT wants to merge 18 commits intomainfrom
viktorpus/70b-galaxy-prefix-caching

Conversation

@viktorpusTT
Copy link
Contributor

@viktorpusTT viktorpusTT commented Jan 15, 2026

Ticket

tenstorrent/vllm#268

Problem description

Automatic prefix caching (https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/#enabling-apc-in-vllm) allows for re-use of cached KV entries when a new prompt arrives which shares its prefix with a previous prompt which was already processed. It will help reduce time to first token (TTFT) significantly when there are multiple users with shared prefixes and when a conversation with a user is continued over many turns.

What's changed

After previous changes:

This PR adds support for Llama 70B Galaxy optimized model.

Checklist

  • All post-commit tests
  • Blackhole Post commit
  • cpp-unit-tests
  • New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

@viktorpusTT viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from ca1eac3 to 134230d Compare January 15, 2026 15:21
@skhorasganiTT skhorasganiTT self-requested a review January 15, 2026 15:43
@viktorpusTT viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from 134230d to 638453b Compare January 30, 2026 12:26
@viktorpusTT viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch 2 times, most recently from 688fdd4 to 250c8c8 Compare February 4, 2026 13:51
@viktorpusTT viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from 250c8c8 to 5e92be5 Compare February 4, 2026 14:09
@viktorpusTT viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from d545675 to 8dac492 Compare February 11, 2026 16:28
prefill_chunk_start_idx = chunk_start_idx
prefill_start_pos = chunk_start_idx # Python int for attention (SDPA path, program config)
prefill_get_last_token = last_token_idx_relative # RELATIVE index for slicing within chunk
prefill_last_token_idx = last_token_idx_relative
prefill_chunk_start_idx = 0
prefill_start_pos = 0
prefill_get_last_token = last_token_idx
prefill_last_token_idx = last_token_idx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant