Feature/mlm #22

bobbycxy · 2024-06-05T09:12:11Z

Added: masked language modelling (mlm) dataloader - called MLMDataloader - that masks inputs (80% masked, 10% randomised, 10% untouched),
updated the cross_entropy to ignore ignore_index=-1, because -1 is the masked token id we'll use in the target (y) labels.
updated the build_trainer.py to have MLMDataloader in the dataloader dictionary.
created a mlm specific config.yaml file.

DylanASHillier · 2024-06-05T09:59:15Z

@LeonGuertler this is a beautful example of how a PR should be --> list of changes

DylanASHillier · 2024-06-05T09:59:58Z

trainers/loss_fn.py

@@ -75,7 +75,7 @@ def compute_perplexity(logits, y, char_lengths, mask=None):
    # flatten both
    logits = logits.view(-1, logits.size(-1))
    y = y.view(-1)
-    loss = torch.nn.functional.cross_entropy(logits, y, reduction="none")
+    loss = torch.nn.functional.cross_entropy(logits, y, reduction="none", ignore_index=-1)


we are 100% sure this has no impacts elsewhere? should be okay but..

Yes. Ignore_index's default value is -100. So, changing it from -100 to -1 would only impact existing label (y) assignments of -1 or -100.

I searched our code base for the assignment of -100, there was none.

I searched our code base for the assignment of -1. Apart from the new MLMDataLoader using label[~mask]=-1, other assignments were used in arguments such as 'dim=-1', or just as index values.

DylanASHillier · 2024-06-05T10:00:19Z

configs/full_configs/mlm_baseline.yaml

+  embedder:
+    tokenizer_type: gpt2
+    embedding_model_type: generic
+    dataset_name: simple_en_wiki


change this to stlm

sorry, can I clarify what should be changed to stlm?

the dataset_name, but its okay, this will all be reworked later

DylanASHillier · 2024-09-05T07:20:59Z

I propose closing this, as we decided not to do MLM for now... feel free to rework this if you like, should be pretty simple to update, and happy to merge in since it doesn't add much code

bobbycxy added 3 commits June 5, 2024 14:43

added feature/mlm

0c7952f

Merge branch 'main' of https://github.com/LeonGuertler/SuperTinyLangu…

7c21f42

…ageModels into feature/mlm

added mlm

805ed7c

bobbycxy assigned DylanASHillier Jun 5, 2024

DylanASHillier reviewed Jun 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/mlm #22

Feature/mlm #22

Uh oh!

bobbycxy commented Jun 5, 2024

Uh oh!

DylanASHillier commented Jun 5, 2024

Uh oh!

DylanASHillier Jun 5, 2024

Uh oh!

bobbycxy Jun 6, 2024

Uh oh!

bobbycxy Jun 6, 2024

Uh oh!

DylanASHillier Jun 5, 2024

Uh oh!

bobbycxy Jun 6, 2024

Uh oh!

DylanASHillier Jun 7, 2024

Uh oh!

DylanASHillier commented Sep 5, 2024

Uh oh!

Uh oh!

Feature/mlm #22

Are you sure you want to change the base?

Feature/mlm #22

Uh oh!

Conversation

bobbycxy commented Jun 5, 2024

Uh oh!

DylanASHillier commented Jun 5, 2024

Uh oh!

DylanASHillier Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

bobbycxy Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

bobbycxy Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

DylanASHillier Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

bobbycxy Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

DylanASHillier Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

DylanASHillier commented Sep 5, 2024

Uh oh!

Uh oh!