Guidance works slower than transformers #643

slewie · 2024-02-20T08:31:52Z

slewie
Feb 20, 2024

Hello! I tried to classify some texts using guidance, but they were slow, and I tried using transformers, and it turned out that the guidance were about 5 times slower.

Code:

@guidance
def classify_guidance(lm, text):
    lm += f"""Classify this text as creating a task (create), getting information about a task (read), updating or changing the corresponding task (update), or deleting a task (delete).

    <Few-shot>

    Text: {text}
    Answer: {select(['create', 'read', 'update', 'delete'], name='answer')}
    """
    return lm

allowed_values = ["create", "read", "update", "delete"]
allowed_value_ids = [tokenizer.encode(value, add_special_tokens=False)[0] for value in allowed_values]

def classify_transformers(text):
    input_text = f"""Classify this text as creating a task (create), getting information about a task (read), updating or changing the corresponding task (update), or deleting a task (delete).

    <Few-shot>

    Text: {text}
    Answer: 
    """
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    
    output = model_transformers(input_ids)
    logits = output.logits[:, -1, :]

    allowed_value_probs = torch.softmax(torch.stack([logits[0][value_id] for value_id in allowed_value_ids]), dim=0)
    return allowed_values[torch.argmax(allowed_value_probs)]

Results for that model:

Guidance
transformers

Also in the guidance there is practically no difference between the sizes of the models, they work with the same time. I tried big codellama and got the same results

Why does this problem occur?

Harsha-Nori · 2024-02-20T17:43:37Z

Harsha-Nori
Feb 20, 2024
Maintainer

Hi @slewie,

One guess here is that your transformer models are running on CPU instead of GPU when loading through the guidance.models.Transformers class. Could you share how you initialized the model_guidance object?

For reference, you can use the device_map parameter in the Transformers model initialization to specify which device you want the model to run on (which matches the HF Transformers definitions, it just gets passed through).

from guidance import models

gpt = models.Transformers('gpt2', device_map='auto') # `auto` or any other device map

1 reply

slewie Feb 20, 2024
Author

I'm already using the device_map for initialization. I think that's not the problem.

Code for models initialization:

model_guidance = models.Transformers("Phind/Phind-CodeLlama-34B-v2", echo=False, device_map='balanced')

model_transformers = LlamaForCausalLM.from_pretrained("codellama/CodeLlama-7b-Instruct-hf", device_map='balanced')
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")

slewie · 2024-02-20T18:08:01Z

slewie
Feb 20, 2024
Author

I think there is more interesting problem, that size doesn't have impact on computation time.

I have loaded two models:

model_llama34 = models.Transformers("Phind/Phind-CodeLlama-34B-v2", echo=False, device_map='balanced')
model_llama7 = models.Transformers("codellama/CodeLlama-7b-Instruct-hf", echo=False, device_map='balanced')

and tried prompts with different length:

190 tokens:
430 tokens:
1160 tokens:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance works slower than transformers #643

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Guidance works slower than transformers #643

Uh oh!

slewie Feb 20, 2024

Replies: 2 comments · 1 reply

Uh oh!

Harsha-Nori Feb 20, 2024 Maintainer

Uh oh!

Uh oh!

slewie Feb 20, 2024 Author

Uh oh!

slewie Feb 20, 2024 Author

slewie
Feb 20, 2024

Replies: 2 comments 1 reply

Harsha-Nori
Feb 20, 2024
Maintainer

slewie Feb 20, 2024
Author

slewie
Feb 20, 2024
Author