Open
Description
Describe the bug
HybridEngine llama2 70B generate have two bugs:
- When inference_tp_size == 1, generate result is not right.
- When inference_tp_size > 1,
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 593, in _prepare_decoder_attention_mask
expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
The size of tensor a (12) must match the size of tensor b (48) at non-singleton dimension 0
System info (please complete the following information):
- transformers: 0.10.3
- deepspeed: 4.31.0