Description
Hello!
I am working on testing out OpenVINO for multi GPU and am getting really terrible performance. I want to understand why disabling stateful inference causes performance to degrade so severely and am interested in contributing.
WIth Stateful disabled and CUMULATIVE_THROUGHPUT enables I get ~13.77 t/s on 2x Arc A770s.
With stateful enabled and LATENCY I get ~25 t/s on 1x Arc A770.
Test code:
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time
model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ns-ov" # Converted with stateful disabled
# model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ov" # Converted with stateful enabled
device = "AUTO:GPU.0,GPU.1" # Here we are using the AUTO plugin prefix
#device = "GPU.1"
ov_config = {
"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT", # Cumulative throughput is the high level performance hint for multi gpu
# "PERFORMANCE_HINT": "LATENCY"
}
model = OVModelForCausalLM.from_pretrained(model_dir, device=device, ov_config=ov_config)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
prompt = "This is a test of the multi gpu performance hint.?"
inputs = tokenizer(prompt, return_tensors="pt")
start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
end_time = time.time()
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
# Calculate performance metrics
input_length = len(inputs.input_ids[0])
output_length = len(outputs[0])
new_tokens = output_length - input_length
total_time = end_time - start_time
tokens_per_second = new_tokens / total_time
print(f"Generated {new_tokens} new tokens in {total_time:.2f} seconds")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")
Am I missing something here? Additionally, It's hard to tell on Linux if weights are actually distributed across devices; I can only infer from htop we are not using CPU/system memory. I want to implement support for this runtime feature in my project OpenArc but have not been able to get it working, especially for models whose compressed weights fit into VRAM budget but uncompressed exceed budget.