Dev#1
Conversation
extend the computation of the T5 relative position embedding to work also on input position ids and not just [0,..ntokens-1]
mosheraboh
left a comment
There was a problem hiding this comment.
Hi @michalozeryflato
Looks good, I've added few questions inline.
| values = self.relative_attention_bias(relative_position_bucket) # shape (query_length, key_length, num_heads) | ||
| values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length) | ||
| if position_ids is not None: | ||
| values = values.permute([0, 3, 1, 2]) |
There was a problem hiding this comment.
Why? can you explain in a comment what each dimension is?
There was a problem hiding this comment.
values.shape was originally (query_length, key_length, num_heads) - see original comment.
I extended it to account for the case when position_ids are given in input.
So shape is (num_batches - optional, query_length, key_length, num_heads).
Added comments in the code.
| memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :] | ||
| device = relative_attention_bias.weight.device | ||
| if position_ids is not None: | ||
| context_position = position_ids.unsqueeze(2) |
There was a problem hiding this comment.
what is the shape before unsqueeze?
There was a problem hiding this comment.
When position_ids are given we have additional batch dimension: shape =(num_batches, key_and_query_length)
I added this in a comment:
context_position = position_ids.unsqueeze(2) # shape (num_batches, key_and_query_lnegth) -> (num_batches, key_and_query_lnegth, 1)
memory_position = position_ids.unsqueeze(1) # shape (num_batches, key_and_query_lnegth) -> (num_batches, 1, key_and_query_lnegth)
| def __init__(self, config): | ||
| super().__init__() | ||
| self.EncDecAttention = T5Attention(config, has_relative_attention_bias=False) | ||
| self.EncDecAttention = T5Attention(config, relative_position_embedding_definitions=None) |
There was a problem hiding this comment.
So you didn't play with this attention mechanism?
There was a problem hiding this comment.
I extended T5Attention to support various relative position encodings either original, None, New (e.g. with position ids), or any combination of original and new.
There was a problem hiding this comment.
In this cases = Note that it always used has_relative_attention_bias=False, so I kept the same performance - did not add the relative position embedding when it did not exist before. We can discuss this
| output_attentions: Optional[bool] = None, | ||
| output_hidden_states: Optional[bool] = None, | ||
| return_dict: Optional[bool] = None, | ||
| encoder_position_ids_dict: Optional[Dict[str, Tuple[torch.LongTensor,str]]] = None, |
There was a problem hiding this comment.
What is the shape of this tensor?
There was a problem hiding this comment.
added a comment:
shape of each LongTensor (num_batches, n_input_tokens)
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.