Open
Description
Hello
I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".
However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time
def make_template(context) :
instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.
#Context:{context}
#Translation:"""
messages=[{"role": "user", "content": f"{instruction}"}]
input_ids=tokenizer.apply_chat_template(messages,
add_generation_prompt = True,
return_tensors="pt")
return input_ids
def translate(context) :
input_ids=make_template(context=context)
outputs=model.generate(input_ids,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p)
answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
return answer.rstrip()
if __name__ == "__main__" :
model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f"Model Device : {model.device}")
max_new_tokens=1024
do_sample=False
temperature=0.1
top_p=0.9
context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''
start_time = time.time()
generated_text = translate(context)
end_time = time.time()
print("generated_text:", generated_text)
num_generated_tokens = len(tokenizer.tokenize(generated_text))
total_time = end_time - start_time
avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')
print(f"Total Inference Time : {total_time} s")
print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")
However, the devices currently available for openvino include NPUs.
If there is a way to use NPU, can you tell me?
Thank you.
Metadata
Metadata
Assignees
Labels
No labels