Skip to content

Generation using cache gives weird sentences #1

@romsto

Description

@romsto

While using cache (past_key_values) during speculative decoding or even autoregressive decoding, the resulting generated tokens might be somewhat weird and non-sense. Because of this behavior, speculative sampling is slowed down (sometimes even being slower than AR decoding).

speculative_generate edits the cache by pruning the last tokens when rejection happens. I first thought the errors came from this.
But the generation is also weird in autoregressive_generate even though the cache is not edited nor pruned.

That leads to think:

  • Am I using the past_key_values wrongly (putting as forward parameter, and getting the newly KVcache in the output)?
  • Or is this a problem coming from my transformers/torch versions (latest stable)?
  • Or is this an issue from transformers itself?

I will greatly appreciate any help or advices in here! Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions