Skip to content

Feature/mlm #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Feature/mlm #22

wants to merge 3 commits into from

Conversation

bobbycxy
Copy link
Collaborator

@bobbycxy bobbycxy commented Jun 5, 2024

  1. Added: masked language modelling (mlm) dataloader - called MLMDataloader - that masks inputs (80% masked, 10% randomised, 10% untouched),
  2. updated the cross_entropy to ignore ignore_index=-1, because -1 is the masked token id we'll use in the target (y) labels.
  3. updated the build_trainer.py to have MLMDataloader in the dataloader dictionary.
  4. created a mlm specific config.yaml file.

@DylanASHillier
Copy link
Collaborator

@LeonGuertler this is a beautful example of how a PR should be --> list of changes

@@ -75,7 +75,7 @@ def compute_perplexity(logits, y, char_lengths, mask=None):
# flatten both
logits = logits.view(-1, logits.size(-1))
y = y.view(-1)
loss = torch.nn.functional.cross_entropy(logits, y, reduction="none")
loss = torch.nn.functional.cross_entropy(logits, y, reduction="none", ignore_index=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are 100% sure this has no impacts elsewhere? should be okay but..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ignore_index's default value is -100. So, changing it from -100 to -1 would only impact existing label (y) assignments of -1 or -100.

  1. I searched our code base for the assignment of -100, there was none.
  2. I searched our code base for the assignment of -1. Apart from the new MLMDataLoader using label[~mask]=-1, other assignments were used in arguments such as 'dim=-1', or just as index values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

embedder:
tokenizer_type: gpt2
embedding_model_type: generic
dataset_name: simple_en_wiki
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to stlm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, can I clarify what should be changed to stlm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataset_name, but its okay, this will all be reworked later

@DylanASHillier
Copy link
Collaborator

I propose closing this, as we decided not to do MLM for now... feel free to rework this if you like, should be pretty simple to update, and happy to merge in since it doesn't add much code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants