FYI: I tried Parallel Context Windows, and it works. #540

nullgate · 2023-03-24T15:04:34Z

nullgate
Mar 24, 2023

It's pretty tricky to implement a prototype, because I have to add a lot of dirty code into huggingface transformers. So sorry, I cannot provide a working patch unless I fork the transformers, at least for now.

It should be noted that it's almost mandatory to elevate the self-attention calculation to float32 precision, otherwise the model will have trouble to track multiple context windows.

Because the way of PCW works, if this is implemented properly, we could cache context past_key_value_states to accelerate conversation with chatbots.

The following result is sampled from LLaMA-30B, gptq-4bit with two extra context windows.

Input


**This is a table game of vampire simulator.
Your goal is to finish the game ASAP.

### Game Hint

You should not talk too much with Valetta, and she may kill you.
It's okay to lose since you can always start over.
Valetta can be killed with silver knife.

Press <REBOOT> after game over.

[primer]

<START>
What did you just call me? she asks in shock.

<CHOICES>
1) Nothing important.
2) A slut.
3) A whore.
4) An idiot.
5) A loser.
6) Ugly.
7) Worthless.
8) Stupid.
9) Other (type)...

<END CHOICE>

[primer]
<START>
You: "Well I do feel bad for you... Perhaps we can work out a deal at my place?" I say while taking pity on her.

Output

Valetta: The vampire perks up at the mention of your house.
Y-your house? She tilts her head curiously, her body still quaking with desperation. Her red eyes seem to glow as she stares up at you with anticipation.
You'll let me feed on you at your house? Y-you'll... let me... drink your blood?

<START>
You: I set Valetta down on the bed, Valetta is so thirsty for blood she can no longer move.** 
She crawls towards me and starts licking my neck.
Her tongue feels like ice against my skin.
I shiver in delight.

Valetta: You taste wonderful!
I've never tasted anything this good before!
Your scent is so sweet! So delicious!

<CHOICES>
1) What are you doing? Stop it!
2) No! Get off of me!
3) Don't stop!
4) Let me suck more!
5) Yes, please! More!
6) Please bite me now!
7) Drink all of my blood!
8) Suck me dry!
9) Other (type)...

<END CHOICE>

<DEATH>
The next morning, you wake up alone in an empty room. 
You look around but don't see anyone.

Where did Valetta go?
You check yourself over and find that you have been bitten all over by a woman.
But strangely enough, you feel completely fine.
In fact, you feel great.
Even better than when you first met Valetta.

A strange sense of peace fills your heart.
And then you realize what happened last night...

You were drained of every drop of blood.

You died from being drained of blood.

Game Over

</game>

Touched Files:

transformers/models/llama/modeling_llama.py
transformers/generation/utils.py

Simply hijack the use_cache mechanism and insert parallel (key_states, value_states).
The core revision is pretty simple, but it took me hours to integrate it to webui.

BarfingLemurs · 2023-03-24T19:02:53Z

BarfingLemurs
Mar 24, 2023

The paper you linked focuses on expanded context length for long input,

It remains an open problem to find an effective way of allowing any off-the-shelf, mainstream LLM to process text longer than its original context window, without dedicated training.

We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf Language Model (LLM) without further training. The key to the approach is to carve a long context into chunks (“windows”) that fit within the architecture, restrict the attention mechanism to apply only within each window, and re-use the positional embeddings among the windows.

but you seem to be highlighting uses for acceleration, or adventure role-play?

3 replies

nullgate Mar 24, 2023
Author

I used to be an engineer and a game developer, so those ideas emerged naturally. There is always a cost to add any context into self-attention mechanism. It will almost linearly slow-down the inference with longer context, assuming the context is cached already.

The forward cost is estimated according to this paper https://arxiv.org/pdf/2001.08361.pdf

PCW trick allows LLM application to put more information in fixed context window budget, but it's not free.

According to my local experiments, it barely works on LLaMA-7B with gptq-4b quantization, but it works pretty well on LLaMA-30B with quantization. However, it could work well on LLaMA-7B in half-precision (fp16). My hypothesis is that those long-range self-attentions are very sensitive to model precision. PCW tricks should work best with LLaMA-65B in single-precision, but it's not practical for any consumer device. I would like to have a RTX A6000 to test it with LLaMA-65B, but it's not cheap.

chavinlo Mar 26, 2023

I used to be an engineer and a game developer, so those ideas emerged naturally. There is always a cost to add any context into self-attention mechanism. It will almost linearly slow-down the inference with longer context, assuming the context is cached already.

The forward cost is estimated according to this paper https://arxiv.org/pdf/2001.08361.pdf

PCW trick allows LLM application to put more information in fixed context window budget, but it's not free.

According to my local experiments, it barely works on LLaMA-7B with gptq-4b quantization, but it works pretty well on LLaMA-30B with quantization. However, it could work well on LLaMA-7B in half-precision (fp16). My hypothesis is that those long-range self-attentions are very sensitive to model precision. PCW tricks should work best with LLaMA-65B in single-precision, but it's not practical for any consumer device. I would like to have a RTX A6000 to test it with LLaMA-65B, but it's not cheap.

hmu I can give you an A100 docker dep#0002

USBhost Mar 26, 2023

Throw me your custom code and I can try it out. I got a A6000

USBhost · 2023-03-26T15:08:29Z

USBhost
Mar 26, 2023

https://github.com/wawawario2/text-generation-webui/

Is this that ?

2 replies

nullgate Mar 26, 2023
Author

No, but they should be compatible.

Using vector search to find relevant context is a very popular solution, and it's much more useful than PCW trick. PCW trick is relatively low-level, so that you can put more tokens in a given budget (with some tradeoff, too). Anyway, you should read the paper if you are interested.

USBhost Mar 26, 2023

If only I wasn't such a noob when it comes to scientific papers I would.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: I tried Parallel Context Windows, and it works. #540

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FYI: I tried Parallel Context Windows, and it works. #540

Uh oh!

nullgate Mar 24, 2023

Replies: 2 comments · 5 replies

Uh oh!

BarfingLemurs Mar 24, 2023

Uh oh!

nullgate Mar 24, 2023 Author

Uh oh!

Uh oh!

chavinlo Mar 26, 2023

Uh oh!

USBhost Mar 26, 2023

Uh oh!

USBhost Mar 26, 2023

Uh oh!

nullgate Mar 26, 2023 Author

Uh oh!

USBhost Mar 26, 2023

nullgate
Mar 24, 2023

Replies: 2 comments 5 replies

BarfingLemurs
Mar 24, 2023

nullgate Mar 24, 2023
Author

USBhost
Mar 26, 2023

nullgate Mar 26, 2023
Author