Skip to content

CUMULATIVE THROUGHPUT performance drop #1204

Open
@SearchSavior

Description

@SearchSavior

Hello!

I am working on testing out OpenVINO for multi GPU and am getting really terrible performance. I want to understand why disabling stateful inference causes performance to degrade so severely and am interested in contributing.

WIth Stateful disabled and CUMULATIVE_THROUGHPUT enables I get ~13.77 t/s on 2x Arc A770s.

With stateful enabled and LATENCY I get ~25 t/s on 1x Arc A770.

Test code:

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time


model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ns-ov" # Converted with stateful disabled
# model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ov" # Converted with stateful enabled 

device = "AUTO:GPU.0,GPU.1" # Here we are using the AUTO plugin prefix
#device = "GPU.1"

ov_config = {
    "PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT", # Cumulative throughput is the high level performance hint for multi gpu
   # "PERFORMANCE_HINT": "LATENCY"
}

model = OVModelForCausalLM.from_pretrained(model_dir, device=device, ov_config=ov_config)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompt = "This is a test of the multi gpu performance hint.?"

inputs = tokenizer(prompt, return_tensors="pt")


start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
end_time = time.time()

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

# Calculate performance metrics
input_length = len(inputs.input_ids[0])
output_length = len(outputs[0])
new_tokens = output_length - input_length
total_time = end_time - start_time
tokens_per_second = new_tokens / total_time

print(f"Generated {new_tokens} new tokens in {total_time:.2f} seconds")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")

Am I missing something here? Additionally, It's hard to tell on Linux if weights are actually distributed across devices; I can only infer from htop we are not using CPU/system memory. I want to implement support for this runtime feature in my project OpenArc but have not been able to get it working, especially for models whose compressed weights fit into VRAM budget but uncompressed exceed budget.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions