-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
While using cache (past_key_values) during speculative decoding or even autoregressive decoding, the resulting generated tokens might be somewhat weird and non-sense. Because of this behavior, speculative sampling is slowed down (sometimes even being slower than AR decoding).
speculative_generate edits the cache by pruning the last tokens when rejection happens. I first thought the errors came from this.
But the generation is also weird in autoregressive_generate even though the cache is not edited nor pruned.
That leads to think:
- Am I using the
past_key_valueswrongly (putting as forward parameter, and getting the newly KVcache in the output)? - Or is this a problem coming from my transformers/torch versions (latest stable)?
- Or is this an issue from transformers itself?
I will greatly appreciate any help or advices in here! Thanks.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed