Intel NPU operation related

Hello

I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".

However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.

```
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time

def make_template(context) :
    instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.

#Context:{context}

#Translation:"""
    
    messages=[{"role": "user", "content": f"{instruction}"}]

    input_ids=tokenizer.apply_chat_template(messages,
                                                    add_generation_prompt = True,
                                                    return_tensors="pt")

    return input_ids

def translate(context) : 
    input_ids=make_template(context=context)
    outputs=model.generate(input_ids,
                                max_new_tokens=max_new_tokens,
                                do_sample=do_sample,
                                temperature=temperature,
                                top_p=top_p)
    
    answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
    
    return answer.rstrip()

if __name__ == "__main__" :
    model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
    model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model Device : {model.device}")

    max_new_tokens=1024
    do_sample=False
    temperature=0.1
    top_p=0.9

    context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''

    start_time = time.time()
    generated_text = translate(context)
    end_time = time.time()

    print("generated_text:", generated_text)

    num_generated_tokens = len(tokenizer.tokenize(generated_text))
    total_time = end_time - start_time
    avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')

    print(f"Total Inference Time : {total_time} s")
    print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")
```
However, the devices currently available for openvino include NPUs.

![image](https://github.com/user-attachments/assets/4d830ee0-a7f4-4b63-b6c7-5c61b75aa582)

If there is a way to use NPU, can you tell me?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel NPU operation related #1081

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intel NPU operation related #1081

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions