- Clone the repository
- Install the following dependencies:
- NumPy
- torch
- Sklearn
- transformers
- evaluate
- nnsight
- PySpark
- Pandas
- matplotlib
- seaborn
Twitter Emotion Dataset: https://www.kaggle.com/datasets/adhamelkomy/twitter-emotion-dataset/data
The models are available here - https://drive.google.com/drive/folders/1D7eiUCLUJFTEy0FTlyFbQirYR2dbRhTw?usp=drive_link .
There are two folders -
model/- used for all predictions, run tests, etc.model_with_tokenizer/- used primarily in token_analysis
Ensure the two models are inside code/
- Run head_masking_samples.py to obtain the samples (the results from this are stored in data/samples if you do not want to install PySpark and run the file.)
- Run head_masking.py to process the sentences obtained from step1 on each variant. The results are stored as npy files in the execution directory under the names
head_masking_class_prob_diff.npyandmulti_head_masking_class_prob_diff.py
- Connect to a python/ ipynb kernel that has the libraries mentioned in the requirements.
- Run the first five cells of
code/token_analysis.ipynb - The remainder of the notebook compares the heatmaps of two sentences. Feel free to experiment by giving custom sentences.
- Run
common_words.pyto obtain the 10 most common words per class and replace them with synonyms (the results from this are stored in data/replaced_words_text.csv if you do not want to install PySpark and run the file.) - Run the cells of
code/synonym_analysis.ipynbto get the probability of the original and synonym-replaced sentences.
- PySpark for sample collection