Replies: 2 comments 5 replies
-
|
The paper you linked focuses on expanded context length for long input,
but you seem to be highlighting uses for acceleration, or adventure role-play? |
Beta Was this translation helpful? Give feedback.
-
|
https://github.com/wawawario2/text-generation-webui/ Is this that ? |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
https://arxiv.org/abs/2212.10947
It's pretty tricky to implement a prototype, because I have to add a lot of dirty code into huggingface transformers. So sorry, I cannot provide a working patch unless I fork the transformers, at least for now.
It should be noted that it's almost mandatory to elevate the self-attention calculation to float32 precision, otherwise the model will have trouble to track multiple context windows.
Because the way of PCW works, if this is implemented properly, we could cache context past_key_value_states to accelerate conversation with chatbots.
The following result is sampled from LLaMA-30B, gptq-4bit with two extra context windows.
Input
Output
Touched Files:
Simply hijack the use_cache mechanism and insert parallel (key_states, value_states).
The core revision is pretty simple, but it took me hours to integrate it to webui.
Beta Was this translation helpful? Give feedback.
All reactions