After reading this document Please look at this notebook
Welcome to your last exercise in this learning journey 🙂 We provide you with a notebook containing an example on how to load a pretrained model inside the hugging face library and use it for generation tasks, we would like to ask you to do the same using other models and evaluate on other datasets.
In particular, in this excercise you will be a master in loading Pretrained models (like T5), writing your own decoding algorithms, as well as investigating what is happening under the hood by interpreting their decisions. You will be (almost) an expert on three tasks Machine Translation, Summarization and Question Answering.
The notebook contains some examples showing some guiding examples based on BART model. This model is a pretrained one that needs finetuning on the target task to perform well. The cool thing about T5 that it is trained jointly on many tasks both supervised and unsupervised such as LM, translation, summarization and question answering by reforming all tasks as "text" to "text". "For example, automatic summarization is done by feeding in a document followed by the text “Summarize:” and then the summary is predicted via autoregressive decoding." In this excercise you will have to append those tokens yourself in the input to the model to be able to use it as a summarization model.
git clone https://github.com/huggingface/transformers.git
%cd transformers
pip install .
-
You will use
t5-smallpretrained model from Hugging Face: https://huggingface.co/t5-small -
The documentation of T5 class can be found here: https://huggingface.co/transformers/model_doc/t5.html
To know more about T5 model:
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (https://arxiv.org/pdf/1910.10683.pdf)
- Stanford guest lecture about T5: http://web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture14-t5.pdf
- Translation:
bible_para(https://huggingface.co/datasets/bible_para),ted_talks_iwslt(https://huggingface.co/datasets/ted_talks_iwslt) - Summarization :
cnn-dailymail(https://huggingface.co/datasets/cnn_dailymail) - Question answering: BoolQ dataset (https://huggingface.co/datasets/boolq)
Note: t5-small can not handle sequences longer than 512 max_length; you would need to preprocess your datasets accordingly as done above in the tokenizer.
For each task you should give a certain prefix augemented to the input (e.g. "translate English to German: " to be able to translate an english input to german) to know each task prefix consider looking into the config https://huggingface.co/t5-small/blob/main/config.json .
OUTPUT: print some examples from the test/validation split of each task showing the input/model output/ target reference.
- Summarizaiton: ROUGE (https://www.aclweb.org/anthology/W04-1013.pdf)
- MT: BLEU (https://www.aclweb.org/anthology/P02-1040/)
- Question Answering : exact match and macro-F1 (https://arxiv.org/pdf/1606.05250.pdf)
- For those you will need a tokenizer you can use an existing implementation of the MOSES tokenizer
- implement evaluation metrics : BLEU, ROUGE
- Select ~1000 sentences from each of the datasets (use
testwhen available, orvalidationsplit otherwise) - Table1: Evaluate your model on those Metrics
- Table2: As a sanity check of your implementation use already existing implementation online of those metrics and compare them together with your implementation.
Now you are not allowed to use the existing implementation of the function model.generate. Read here about different usages of this function including many decoding algorithms beam, sampling, top-k and nucleus sampling
- Implement a beam search generation function that takes beamsize as a function parameter.
- Implement a Nucleus sampling function that samples from a model using Nucleus sampling taking top-p as a function parameter.
- Implement Softmax with Temperature function that samples from a model using Sampling with Temperature taking temperature(t) as a function parameter.
-
Table1: Check Correctness of your implementation, in a table show a comparison between results obtained from model.generate function and your implementation for different beam-sizes for beamsearch and top-p
-
Table2: Compare between different decoding methods. for summarization and machine translation and question answering try different decoding methods for example try changing the top-p value in the nucleus sampling algorithm the temperature of the softmax and with the beamsize in the beam search (for this only you are allowed to use model.generate and existing implementation of evaluation metrics
-
Short Report 300 words max: Given the results you obtained above. Write a short report containing your conclusions on which are the best decoding algorithm / parameter for each task. Why do you think they are the best? Does increasing the beam size usually give better scores? Why or Why not?
The goal of this exercise is to understand whether (and how) the attention can be used to interpret model's behaviour.
Select several examples for each task and manually examine the attention patterns for each of those tasks. What are your observations? Is there any difference in attention patterns; is there any common patterns?
Plots: You are expected to output plots similar to those in this blogpost (section attention visualization).
We expect you to visualize at least three plots showing the following
- Vizualize attention matrices per each head and each layer
- Aggregate the attention values across heads/layers.
- Consider examples from different categories that would take into account: model performance (hard vs easy examples), input length, different task.
Short report max 300 words: Add below each of the attention values above. Your comments Highlighting those patterns and what do you observe: eg. common or different patterns across tasks, how those patterns change across layers, individual attention heads versus aggregated attention patterns, any other observations.
Manual examination allows to get an intuition of what attention patterns are. Aggregation metrics allow to make corpus-wide conclusions about the roles of different attention heads. Check (this paper)[https://www.aclweb.org/anthology/P19-1580/] for more details. Implement one or two of the "aggregation" metrics proposed in that paper or (this other paper)[https://aclanthology.org/2021.findings-acl.250/]. Compare the attention patterns across the tasks.
- Plots and short report: Implement one of these methods for Attention aggregation and plot 3 plots showing some of the aspects above and discuss what do you learn from aggregated attentions.
- Take any available model on Hugging face which was trained/fine-tuned specifically for the above mentioned tasks (translation, summarization, question answering)
- Perform task 1 and task 2 with those task-adapted models; Compare it to T5 performance/behaviour.
- Table: On a single task compare task 1 and 2 using several evaluation metrics and interpretability measures from the above (you can use existing implementation for those metrics).
- Short report 300 words max : comment on What is common and different between these models in terms of interpretability and evaluation metrics? Does the finetuned model perform better than T5 model who was trained on all tasks together? Why would you use one instead of the other?
Neural Language Generation models are silly what they believe the highest likely sequence is usually an empty sequence (<s></s>) This problem is demonstrated in the following paper: On NMT Search Errors and Model Errors: Cat Got Your Tongue?.
This problem is puzzling many scientists at the moment. A method to overcome is to sample many output of the model and rank them according to their pairwise utility. This is a tracktable approximation of a method called Minimum bayes risk decoding. That has been recently proposed in this recent work Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation.
In this bonus task we ask you to implement this decoding method as the two one above (you can use any utility function of your choice in the paper they use METEOR python implementation is available online e.g. here https://pypi.org/project/textmetrics/).
- Table: Compare MBR decoding vs Beam search with beam size=5, beam size=10, beam size=15 other on machine translation task above.
- Short Report 300 words max: Given the results you obtained above. Write a short report containing your conclusions. What on which are the best decoding algorithm / parameter for each task. Why is that? what are your conclusions?
Overall you have Two tasks with 8 deliverables with 3 optional ones:
- Deliverable 1.1 (2 pt)
- Deliverable 1.2 (3 pt)
- Deliverable 1.3 (8 pt)
- Deliverable 2.1 (3 pt)
- Deliverable 2.2 (4 pt)
- Deliverable Bonus 1 (3pt)
- Deliverable Bonus 2 (5pt)
- All deliverables are expected to be submitted in a single colab notebook.
- In your notebook please highlight each deliverable by its title (e.g. # Deliverable 1.2)..etc
- Please stick to the format of each deliverable being a table short report or a plot as identified above
- Please name your notebook on the following format DSBA_EXCERCISE3_FIRSTNAME_LASTNAME (where firstname and lastname are those of the one who will submit the exercise on behalf of the team)
- Please make sure that your notebook is publicly accessible through the provided URL.
Submit your excercise by filling the following form (one submission per team): https://forms.gle/8439iGzRF8fZ9GgT6
