Generation using cache gives weird sentences

While using cache (`past_key_values`) during speculative decoding or even autoregressive decoding, the resulting generated tokens might be somewhat weird and non-sense. Because of this behavior, speculative sampling is slowed down (sometimes even being slower than AR decoding).

`speculative_generate` edits the cache by pruning the last tokens when rejection happens. I first thought the errors came from this.
But the generation is also weird in `autoregressive_generate` even though the cache is not edited nor pruned.

That leads to think:
- Am I using the `past_key_values` wrongly (putting as forward parameter, and getting the newly KVcache in the output)?
- Or is this a problem coming from my transformers/torch versions (latest stable)?
- Or is this an issue from transformers itself?

I will greatly appreciate any help or advices in here! Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation using cache gives weird sentences #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Generation using cache gives weird sentences #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions