Description
To Reproduce
inference script:
`import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed
model_name = "/home/pzl/models/Qwen2.5-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
ds_engine = deepspeed.init_inference(model,
mp_size=1,
dtype=torch.half,
replace_with_kernel_inject=True)
input_text = "DeepSpeed is?"
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = ds_engine.module.generate(**inputs, max_length=10)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
`
Expected behavior
get the correct result
ds_report output
Screenshots
System info (please complete the following information):
- OS: Ubuntu 22.04
- GPU count and types 0
- Python 3.10
Activity