Skip to content

Dev#1

Draft
michalozeryflato wants to merge 5 commits into
mainfrom
dev
Draft

Dev#1
michalozeryflato wants to merge 5 commits into
mainfrom
dev

Conversation

@michalozeryflato

Copy link
Copy Markdown
Owner

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

extend the computation of the T5 relative position embedding
to work also on input position ids
and not just [0,..ntokens-1]
@michalozeryflato michalozeryflato marked this pull request as ready for review October 4, 2023 08:20

@mosheraboh mosheraboh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michalozeryflato
Looks good, I've added few questions inline.

values = self.relative_attention_bias(relative_position_bucket) # shape (query_length, key_length, num_heads)
values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length)
if position_ids is not None:
values = values.permute([0, 3, 1, 2])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? can you explain in a comment what each dimension is?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values.shape was originally (query_length, key_length, num_heads) - see original comment.
I extended it to account for the case when position_ids are given in input.
So shape is (num_batches - optional, query_length, key_length, num_heads).
Added comments in the code.

memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :]
device = relative_attention_bias.weight.device
if position_ids is not None:
context_position = position_ids.unsqueeze(2)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the shape before unsqueeze?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When position_ids are given we have additional batch dimension: shape =(num_batches, key_and_query_length)
I added this in a comment:

context_position = position_ids.unsqueeze(2) # shape (num_batches, key_and_query_lnegth) ->  (num_batches, key_and_query_lnegth, 1)
 memory_position = position_ids.unsqueeze(1) # shape (num_batches, key_and_query_lnegth) ->  (num_batches, 1, key_and_query_lnegth)

def __init__(self, config):
super().__init__()
self.EncDecAttention = T5Attention(config, has_relative_attention_bias=False)
self.EncDecAttention = T5Attention(config, relative_position_embedding_definitions=None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you didn't play with this attention mechanism?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended T5Attention to support various relative position encodings either original, None, New (e.g. with position ids), or any combination of original and new.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this cases = Note that it always used has_relative_attention_bias=False, so I kept the same performance - did not add the relative position embedding when it did not exist before. We can discuss this

output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
encoder_position_ids_dict: Optional[Dict[str, Tuple[torch.LongTensor,str]]] = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the shape of this tensor?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment:
shape of each LongTensor (num_batches, n_input_tokens)

@michalozeryflato michalozeryflato marked this pull request as draft October 24, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants