enable cpu paged cache #42869

jiqing-feng · 2025-12-15T08:43:57Z

CPU can also use paged cache with eager or sdpa:
python continuous_batching_simple.py --attn sdpa

Without this change, the previous command error would be like:

Error in generation loop: unsupported operand type(s) for -: 'NoneType' and 'int'
Traceback (most recent call last):
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/continuous_api.py", line 1017, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
                            ^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 191, in __init__
    num_blocks, max_batch_tokens = memory_handler.infer_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 481, in infer_num_blocks_and_max_batch_tokens
    num_blocks, max_batch_tokens = self.compute_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 522, in compute_num_blocks_and_max_batch_tokens
    cache_memory = self.get_available_memory(max_memory_percent)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 456, in get_available_memory
    available_memory = total - max(allocated, reserved)
                       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

remi-or · 2025-12-15T11:16:25Z

Hi @jiqing-feng , thanks for the contribution! Just letting you know that CPU-compatible continuous batching is not a priority right now, so even though this PR is small, it will not be reviewed right away. I am cautious about two things:

How device map "auto" behaves and how it affects the model's repartition
The lack of tests / benchmarks. We have a small template for continuous batching PRs, as in [CB] Easy optimizations for continuous batching #42839 if you can follow it, that would be great.

Will get to review this as soon as I have the bandwidth, thanks you!

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng marked this pull request as ready for review December 15, 2025 08:45

jiqing-feng mentioned this pull request Dec 15, 2025

sdpa_paged: How does it handle paged cache without padding? #42868

Open

jiqing-feng added 2 commits December 15, 2025 14:53

enable cpu paged cache

f37459e

Signed-off-by: jiqing-feng <[email protected]>

enable cpu example

9c6d115

Signed-off-by: jiqing-feng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable cpu paged cache #42869

enable cpu paged cache #42869

jiqing-feng commented Dec 15, 2025 •

edited

Loading

Uh oh!

remi-or commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enable cpu paged cache #42869

Are you sure you want to change the base?

enable cpu paged cache #42869

Conversation

jiqing-feng commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-or commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiqing-feng commented Dec 15, 2025 •

edited

Loading