Generate: remove most decoder-only LLMs `prepare_inputs_for_generation` #33870

gante · 2024-10-01T16:25:16Z

What does this PR do?

Part of step 6 in #32685
Follow-up to #33677

This PR:

revises GenerationMixin.prepare_inputs_for_generation so as to handle models WITHOUT the Cache refactor, prepare token_type_ids, and forward arbitrary kwargs
because of 1., we can remove this function from most decoder-only LLMs 🧹🤗🧹 All decoder-only LLMs were checked
added a comment on each overwrite occurring in decoder-only LLMs, for our future selves

✅ slow tests were ran on llama and gpt2

HuggingFaceDocBuilderDev · 2024-10-01T16:25:30Z

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

Add the run-slow label to the PR
When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
- If the pull request affects a lot of models, put at most 10 models in the commit message
A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

gante · 2024-10-01T16:27:59Z

src/transformers/generation/utils.py

@@ -350,47 +350,69 @@ def prepare_inputs_for_generation(
        attention_mask: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        cache_position: Optional[torch.LongTensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,


Not all models expect this one. We now inspect the signature to determine whether we need to generate them on the fly

gante · 2024-10-01T16:28:31Z

src/transformers/generation/utils.py

-        use_cache: bool = True,
-        num_logits_to_keep: Optional[int] = None,


these are moved to kwargs. We now forward kwargs to the model inputs :)

HuggingFaceDocBuilderDev · 2024-10-01T16:49:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

akshit397a

This is up to mark working efficiently

zucchini-nlp

Wow, so much code killed, thanks!

zucchini-nlp · 2024-10-03T08:27:16Z

src/transformers/models/blenderbot/modeling_blenderbot.py

+        # Overwritten -- model logic breaks when `inputs_embeds` are passed from this function
+


Just curious: does that mean blenderbot cannot generate from inputs embeds and it cannot be fixed? I see many models touched here didn't pass further inputs embeds, so that mean after this PR all of them will support generation from embeddings. So interesting to see why this model failed

I see many models touched here didn't pass further inputs embeds, so that mean after this PR all of them will support generation from embeddings.

Precisely! Many models will get this feature for free as part of these deletions 💛

Just curious: does that mean blenderbot cannot generate from inputs embeds and it cannot be fixed?

No clue, I didn't dive deeper :) Failed in inputs_embeds tests -> pasted this comment. I don't think these combos of model/feature are worth the dive, so I left this low-information (but better than nothing) note

Actually the test was just flaky! I've added flakiness protection to the failing test and deleted a few more cases :)

zucchini-nlp · 2024-10-03T08:39:12Z

tests/models/paligemma/test_modeling_paligemma.py

+    @unittest.skip(reason="TODO (@joao): fix me -- failing to produce similar results")
+    def test_static_cache_matches_dynamic(self):
+        pass
+


I think this was marked flaky for VLMs in one of the other PRs

With this PR, it becomes a failure all the times 👀 I have no idea why (didn't dive)

super sad, i started diving a while ago and that seems related to paligemma's weird masking for prefix/suffix. I'll see if I can get time to spot the bug

gante · 2024-10-03T16:09:46Z

tests/test_modeling_common.py

@@ -2837,7 +2837,7 @@ def test_inputs_embeds_matches_input_ids(self):

    def test_inputs_embeds_matches_input_ids_with_generate(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-        for model_class in self.all_model_classes:
+        for model_class in self.all_generative_model_classes:


(this test calls generate)

ArthurZucker · 2024-10-04T11:15:30Z

src/transformers/generation/utils.py

+        if (
+            attention_mask is not None
+            and kwargs.get("position_ids") is None
+            and "position_ids" in set(inspect.signature(self.forward).parameters.keys())


quick Q, how fast is this / is it slowing down generation?

we can store the inspect result if needed otherwise!

It's not too bad, but can be improved, yes. On my machine, this adds 0.024ms per generated token (small, but not negligible). If we cache the inspect.signature, we reduce it by 100x.

We actually make several inspect.signature(foward) calls in generate and other bits of the codebase, I think it makes sense to store the inspect as a cached model property (e.g. model.forward_signature). WDYT? If you agree, I'll open a follow-up PR with this change

For completeness, script to measure the impact of caching this call:

import time import inspect from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("distilgpt2") # Fresh inspect all_times = [] for _ in range(1000): start = time.time() "position_ids" in set(inspect.signature(model.forward).parameters.keys()) all_times.append(time.time() - start) print(sum(all_times) / len(all_times)) # Cached inspect signature_keys = set(inspect.signature(model.forward).parameters.keys()) all_times = [] for _ in range(1000): start = time.time() "position_ids" in signature_keys all_times.append(time.time() - start) print(sum(all_times) / len(all_times))

makes sense

ArthurZucker · 2024-10-04T11:15:40Z

src/transformers/generation/utils.py


+        # 4. Create missing `position_ids` on the fly


ArthurZucker · 2024-10-04T11:16:17Z

src/transformers/generation/utils.py

+        ):
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            kwargs["position_ids"] = position_ids  # placed in kwargs for further processing (see below)


seen in other PRs, that it needed to be sliced to seq_length no? -seq_len:

Yes, slicing happens in the code block after this one. That code block abstracts slicing to other input names (e.g. token_type_ids needs to be sliced exactly like position_ids -- and we can add more to this list as needed 🤗 )

ArthurZucker · 2024-10-04T11:16:42Z

src/transformers/generation/utils.py

+        for key, value in kwargs.items():
+            if key not in model_inputs:
+                model_inputs[key] = value


not sure this is super efficient TBH!

Its run time is negligible, even if kwargs contains a handful of entries (usually it will only contain one or two). At most 0.001 ms per call :P

On the plus side, this code block will allow us to generalize this function to VLMs 😉 I think that's worth the super small cost.

import time import torch all_times = [] for _ in range(1000): model_inputs = {str(i): i for i in range(10)} kwargs = {'a': 1, 'b': 2, 'c': torch.zeros((100, 100)), "0": 12, "1": 3546} start = time.time() for key, value in kwargs.items(): if key not in model_inputs: model_inputs[key] = value all_times.append(time.time() - start) print(sum(all_times) / len(all_times))

ArthurZucker

Okay good for me, let's fix generate tests if related

gante · 2024-10-09T11:15:41Z

before merging, ran locally:

py.test tests/models/ -k greedy -- all PT tests passing
RUN_SLOW=1 py.test tests/models/gpt2/test_modeling_gpt2.py
RUN_SLOW=1 py.test tests/models/llama/test_modeling_llama.py (there are two failing tests, but they come from main)

The error in PEFT is occurring after this transformers change: huggingface/transformers#33870 Now, in our tests, some model_kwargs no longer necessarily contain past_key_values, resulting in a KeyError. We now account for this possibility. Affected models were opt and gpt2.

…n` (huggingface#33870)

The error in PEFT is occurring after this transformers change: huggingface/transformers#33870 Now, in our tests, some model_kwargs no longer necessarily contain past_key_values, resulting in a KeyError. We now account for this possibility. Affected models were opt and gpt2.

gante commented Oct 1, 2024

View reviewed changes

akshit397a approved these changes Oct 1, 2024

View reviewed changes

gante requested review from ArthurZucker and zucchini-nlp October 2, 2024 09:52

zucchini-nlp approved these changes Oct 3, 2024

View reviewed changes

gante commented Oct 3, 2024

View reviewed changes

gante added 11 commits October 3, 2024 16:59

up to ctrl

39aff36

delete many prepare_inputs_for_generation

860910b

fix test

c70c0e2

fix paligemma tests after this pr's changes

26dcd92

ignore labels

9fbaebb

handle empty outputs

8f272b3

skip test :(

7b66d42

add test

d4bcbd1

fix flakiness in test; remove a few more :D

3edca4c

handle edge case without crashing

73ceafa

ok, this one doesn't work

d9ad07e

gante force-pushed the gpt2_prepare branch from a09a4b2 to d9ad07e Compare October 3, 2024 17:00

ArthurZucker reviewed Oct 4, 2024

View reviewed changes

ArthurZucker approved these changes Oct 4, 2024

View reviewed changes

gante and others added 4 commits October 7, 2024 10:57

Merge branch 'main' into gpt2_prepare

215d5f7

make fixup

3aeccb1

Merge branch 'main' into gpt2_prepare

bb5428a

test all

11fd910

gante merged commit 295a90c into huggingface:main Oct 9, 2024
24 checks passed

gante deleted the gpt2_prepare branch October 9, 2024 11:15

gante mentioned this pull request Oct 9, 2024

tracker: move prepare_inputs_for_generation into the generation mixin 🧹 #32685

Closed

8 tasks

BenjaminBossan mentioned this pull request Oct 9, 2024

FIX Prompt learning with latest transformers error huggingface/peft#2140

Merged

gante mentioned this pull request Oct 9, 2024

Generate: move prepare_inputs_for_generation in encoder-decoder llms #34048

Merged

echarlaix mentioned this pull request Oct 30, 2024

transformers 4.46 huggingface/optimum#2078

Merged

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Generate: remove most decoder-only LLMs `prepare_inputs_for_generatio…

ee718e1

…n` (huggingface#33870)

yafshar mentioned this pull request Feb 4, 2025

Add _prepare_inputs_for_generation huggingface/optimum-habana#1743

Merged

3 tasks

		use_cache: bool = True,
		num_logits_to_keep: Optional[int] = None,

		# Overwritten -- model logic breaks when `inputs_embeds` are passed from this function

Generate: remove most decoder-only LLMs prepare_inputs_for_generation #33870

Generate: remove most decoder-only LLMs prepare_inputs_for_generation #33870

Uh oh!

Conversation

gante commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 1, 2024

Uh oh!

akshit397a left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

gante commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

Generate: remove most decoder-only LLMs `prepare_inputs_for_generation` #33870

Generate: remove most decoder-only LLMs `prepare_inputs_for_generation` #33870

gante commented Oct 1, 2024 •

edited

Loading

gante Oct 4, 2024 •

edited

Loading

gante Oct 4, 2024 •

edited

Loading