CUMULATIVE THROUGHPUT performance drop

Hello!

I am working on testing out OpenVINO for multi GPU and am getting really terrible performance. I want to understand why disabling stateful inference causes performance to degrade so severely and am interested in contributing.

WIth Stateful disabled and CUMULATIVE_THROUGHPUT enables I get ~13.77 t/s on 2x Arc A770s.

With stateful enabled and LATENCY I get ~25 t/s on 1x Arc A770.

Test code:

```
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time


model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ns-ov" # Converted with stateful disabled
# model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ov" # Converted with stateful enabled 

device = "AUTO:GPU.0,GPU.1" # Here we are using the AUTO plugin prefix
#device = "GPU.1"

ov_config = {
    "PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT", # Cumulative throughput is the high level performance hint for multi gpu
   # "PERFORMANCE_HINT": "LATENCY"
}

model = OVModelForCausalLM.from_pretrained(model_dir, device=device, ov_config=ov_config)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompt = "This is a test of the multi gpu performance hint.?"

inputs = tokenizer(prompt, return_tensors="pt")


start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
end_time = time.time()

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

# Calculate performance metrics
input_length = len(inputs.input_ids[0])
output_length = len(outputs[0])
new_tokens = output_length - input_length
total_time = end_time - start_time
tokens_per_second = new_tokens / total_time

print(f"Generated {new_tokens} new tokens in {total_time:.2f} seconds")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")
```


Am I missing something here? Additionally, It's hard to tell on Linux if weights are actually distributed across devices; I can only infer from htop we are not using CPU/system memory. I want to implement support for this runtime feature in my project [OpenArc](https://github.com/SearchSavior/OpenArc) but have not been able to get it working, especially for models whose compressed weights fit into VRAM budget but uncompressed exceed budget. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUMULATIVE THROUGHPUT performance drop #1204

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUMULATIVE THROUGHPUT performance drop #1204

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions