Initial version of prefix caching for Llama3 70B Galaxy by viktorpusTT · Pull Request #35904 · tenstorrent/tt-metal

viktorpusTT · 2026-01-15T13:43:50Z

Ticket

Problem description

Automatic prefix caching (https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/#enabling-apc-in-vllm) allows for re-use of cached KV entries when a new prompt arrives which shares its prefix with a previous prompt which was already processed. It will help reduce time to first token (TTFT) significantly when there are multiple users with shared prefixes and when a conversation with a user is continued over many turns.

What's changed

After previous changes:

enabling VLLM to support APC: Automatic Prefix Caching support vllm#272
adding APC support to tt-transformers: Add support for Automatic Prefix Caching in TT-Transformers #33883
adding support for paged+chunked prefill in ring distributed sdpa: Add support for paged KV cache and chunked prefill to ring distributed sdpa #35742

This PR adds support for Llama 70B Galaxy optimized model.

Checklist

New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

models/demos/llama3_70b_galaxy/tt/llama_model.py

models/demos/llama3_70b_galaxy/tt/generator.py

models/demos/llama3_70b_galaxy/tt/llama_model.py

models/demos/llama3_70b_galaxy/tt/generator.py

… debug prints cleanup.

models/demos/llama3_70b_galaxy/tt/generator.py

+            prefill_chunk_start_idx = chunk_start_idx
+            prefill_start_pos = chunk_start_idx  # Python int for attention (SDPA path, program config)
+            prefill_get_last_token = last_token_idx_relative  # RELATIVE index for slicing within chunk
+            prefill_last_token_idx = last_token_idx_relative


models/demos/llama3_70b_galaxy/tt/generator.py

+            prefill_chunk_start_idx = 0
+            prefill_start_pos = 0
+            prefill_get_last_token = last_token_idx
+            prefill_last_token_idx = last_token_idx


github-code-quality bot found potential problems Jan 15, 2026

View reviewed changes

models/demos/llama3_70b_galaxy/tt/llama_model.py Fixed Show fixed Hide fixed

models/demos/llama3_70b_galaxy/tt/llama_model.py Fixed Show fixed Hide fixed

viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from ca1eac3 to 134230d Compare January 15, 2026 15:21

skhorasganiTT self-requested a review January 15, 2026 15:43

viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from 134230d to 638453b Compare January 30, 2026 12:26

github-code-quality bot found potential problems Jan 30, 2026

View reviewed changes

viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch 2 times, most recently from 688fdd4 to 250c8c8 Compare February 4, 2026 13:51

github-code-quality bot found potential problems Feb 4, 2026

View reviewed changes

models/demos/llama3_70b_galaxy/tt/generator.py Fixed Show fixed Hide fixed

models/demos/llama3_70b_galaxy/tt/generator.py Fixed Show fixed Hide fixed

models/demos/llama3_70b_galaxy/tt/generator.py Fixed Show fixed Hide fixed

viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from 250c8c8 to 5e92be5 Compare February 4, 2026 14:09

viktorpusTT added 18 commits February 11, 2026 11:07

Initial version of prefix caching for Llama3 70B Galaxy

e87d3a3

Fix rot mats propagation, debug prints

fb4f7fd

Updates, messy debug prints.

08b93c9

Rot mats slicing in traced region, extend tests to do prefix caching,…

4a3ba4e

… debug prints cleanup.

Revert changes to sampling generator

813fdb3

Cleanup

71a866a

Cleanup

ef42090

Use flexible sdpa in prefix caching

dd765dd

Debugging

ca55558

Debug

34f64db

Debugging

5326529

Debugging

998ee47

Cleanup. Things are working now.

238fe96

Revert sdpa test script

1b395af

Cleanup for a nicer PR

2108264

Debug, test update

e850f92

Fixes, debug disables

0ef3bdd

Test script update.

8dac492

viktorpusTT force-pushed the viktorpus/70b-galaxy-prefix-caching branch from d545675 to 8dac492 Compare February 11, 2026 16:28

github-code-quality bot found potential problems Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version of prefix caching for Llama3 70B Galaxy#35904

Initial version of prefix caching for Llama3 70B Galaxy#35904
viktorpusTT wants to merge 18 commits intomainfrom
viktorpus/70b-galaxy-prefix-caching

viktorpusTT commented Jan 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viktorpusTT commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Model tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viktorpusTT commented Jan 15, 2026 •

edited

Loading