Skip to content

Intel NPU operation related #1081

Open
Open
@Oneul-hyeon

Description

@Oneul-hyeon

Hello

I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".

However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time

def make_template(context) :
    instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.

#Context:{context}

#Translation:"""
    
    messages=[{"role": "user", "content": f"{instruction}"}]

    input_ids=tokenizer.apply_chat_template(messages,
                                                    add_generation_prompt = True,
                                                    return_tensors="pt")

    return input_ids

def translate(context) : 
    input_ids=make_template(context=context)
    outputs=model.generate(input_ids,
                                max_new_tokens=max_new_tokens,
                                do_sample=do_sample,
                                temperature=temperature,
                                top_p=top_p)
    
    answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
    
    return answer.rstrip()

if __name__ == "__main__" :
    model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
    model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model Device : {model.device}")

    max_new_tokens=1024
    do_sample=False
    temperature=0.1
    top_p=0.9

    context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''

    start_time = time.time()
    generated_text = translate(context)
    end_time = time.time()

    print("generated_text:", generated_text)

    num_generated_tokens = len(tokenizer.tokenize(generated_text))
    total_time = end_time - start_time
    avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')

    print(f"Total Inference Time : {total_time} s")
    print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")

However, the devices currently available for openvino include NPUs.

image

If there is a way to use NPU, can you tell me?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions