Skip to content

🚨 [generate] Never use cache_position anymore in generation#44816

Open
Cyrilvallez wants to merge 11 commits intomainfrom
fully-remove-cache-pos-from-generate
Open

🚨 [generate] Never use cache_position anymore in generation#44816
Cyrilvallez wants to merge 11 commits intomainfrom
fully-remove-cache-pos-from-generate

Conversation

@Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Mar 18, 2026

What does this PR do?

As per the title. This is the last of many PR to remove the cache_position. At this point, all the models were already updated to not use them, and they are fully ignored in all the modelings. So this removes their creation and usage in generate, so they are not passed as kwarg anywhere anymore.
This is fully safe as all the models already ignore them.

Note: the 🚨 marker is ONLY FOR REMOTE CODE. On the main repo, all models were previously adapted as explained, so no BC issues. For remote code however, as most things, it can break if the code is using cache_position in weird way and do not provide a creation fallback inside the model.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Cyrilvallez Cyrilvallez changed the title [generate] Never use cache_position anymore in generation [generate] 🚨 Never use cache_position anymore in generation Mar 18, 2026
@Cyrilvallez Cyrilvallez changed the title [generate] 🚨 Never use cache_position anymore in generation 🚨 [generate] Never use cache_position anymore in generation Mar 18, 2026
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still remember it, let's kick it out from docs as well pls and if needed add correct examples with cache

@Cyrilvallez
Copy link
Member Author

run-slow: dia

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/dia"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 37af686e workflow commit (merge commit)
PR 5cb41c5b branch commit (from PR)
main 4ec84a02 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@huggingface huggingface deleted a comment from github-actions bot Mar 18, 2026
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: csm, dia, ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, janus, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not approving yet because I want to discuss the deprecation a bit more

  1. I still found references where imo they shouldnt be there
  2. Do we keep cache positions in the generate preparation (alias for position ids): I think we will get a remote model apocalypse otherwise and vLLM already showed how brittle it is

Comment on lines -2071 to -2073
# build `cache_position` on the fly
seq_length = inputs["input_ids"].shape[1]
inputs = self.model._get_initial_cache_position(seq_length, self.model.device, inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my mind, can we run-slow with whisper

Comment on lines -933 to -941
# Cache position (always 1D)
if (cache_position := model_kwargs.get("cache_position")) is not None:
next_cache_position = (
torch.arange(num_new_tokens, dtype=cache_position.dtype, device=cache_position.device)
+ cache_position[-1]
+ 1
)
next_cache_position = torch.cat((cache_position, next_cache_position))
model_kwargs["cache_position"] = next_cache_position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think complete removal for cache position will be too breaking for remote code and there is still quite a lot, just looking at all the vLLM stuff we had to fix 😭

Shouldn't pos ids be the same as cache positions now? What do you think about passing this as alias kwarg as well - we really need to check with a remote model, e.g. deepseek v3 remote code maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: Are we deprecating it everywhere?

I think I still see a few ocurrences:

  • Mask creation
  • Within this executorch integration
  • Models
    • Lfm2
    • Ministral3
    • Mistral4
  • Tests

Imo, only the mask might be critical and might be kept a bit longer. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants