Skip to content

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Dec 15, 2025

CPU can also use paged cache with eager or sdpa:
python continuous_batching_simple.py --attn sdpa

Without this change, the previous command error would be like:

Error in generation loop: unsupported operand type(s) for -: 'NoneType' and 'int'
Traceback (most recent call last):
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/continuous_api.py", line 1017, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
                            ^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 191, in __init__
    num_blocks, max_batch_tokens = memory_handler.infer_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 481, in infer_num_blocks_and_max_batch_tokens
    num_blocks, max_batch_tokens = self.compute_num_blocks_and_max_batch_tokens(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 522, in compute_num_blocks_and_max_batch_tokens
    cache_memory = self.get_available_memory(max_memory_percent)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiqing/transformers/src/transformers/generation/continuous_batching/cache.py", line 456, in get_available_memory
    available_memory = total - max(allocated, reserved)
                       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

@remi-or
Copy link
Collaborator

remi-or commented Dec 15, 2025

Hi @jiqing-feng , thanks for the contribution! Just letting you know that CPU-compatible continuous batching is not a priority right now, so even though this PR is small, it will not be reviewed right away. I am cautious about two things:

  1. How device map "auto" behaves and how it affects the model's repartition
  2. The lack of tests / benchmarks. We have a small template for continuous batching PRs, as in [CB] Easy optimizations for continuous batching #42839 if you can follow it, that would be great.

Will get to review this as soon as I have the bandwidth, thanks you!

Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants