Codebase for a project on evaluating supervised and zero-shot approaches to text classification.
To reproduce the environment, install packages from a YAML file using conda.
conda env create --file environment.yamlfigures/- output figures.notebooks/- notebooks to collect the data and run experiments.i-data-preparation.ipynb- collect data for the project.ii-classic-supervised-baselines.ipynb- establish baselines using "classic" supervised learning approaches to text classification, including Naive Bayes, Logistic Regression and Support Vector Machines.iii-neural-supervised-baselines.ipynb- establish baselines using neural supervised learning approaches to text classification, including fastText, Convolutional Neural Network (CNN) and DeBERTa Transformer.iv-zero-shot-experiments.ipynb- experiment with DeBERTa for zero-shot classification.v-reporting-results.ipynb- create output tables and figures for the paper.
src/- codebase for conducting experiments using different frameworks.
| # | Name | Task | Train/Validation/Test Examples | Test Set Undersampling1 |
Target Cardinality | Source |
|---|---|---|---|---|---|---|
| 1 | Rotten Tomatoes | Sentiment Analysis | 8530 / 1066 / 1066 | No | 2 | HuggingFace Datasets |
| 2 | IMDb | Sentiment Analysis | 25,000 / 0 / 25,000 | Yes | 2 | HuggingFace Datasets |
| 3 | Yelp-2 | Sentiment Analysis | 560,000 / 0 / 38,000 | Yes | 2 | HuggingFace Datasets |
| 4 | Yelp-5 | Sentiment Analysis | 650,000 / 0 / 50,000 | Yes | 5 | HuggingFace Datasets |
| 5 | SST-5 | Sentiment Analysis | 8544 / 1101 / 2210 | No | 5 | HuggingFace Datasets |
| 6 | dynasent | Sentiment Analysis | 13,065 / 720 / 720 | No | 3 | HuggingFace Datasets |
| 7 | AG News | News Categorisation | 120,000 / 0 / 7,600 | No | 4 | HuggingFace Datasets |
| 8 | 20 Newsgroups | News Categorisation | 11,314 / 0 / 7,532 | No | 20 | Scikit-learn Datasets |
| 9 | DBpedia14 | Topic Classification | 560,000 / 0 / 70,000 | Yes | 14 | HuggingFace Datasets |
| 10 | Web of Science | Topic Classification | 46,985 / 0 / 02 | No | 134 | Mendeley Data |
The figures below report the performance on the test set.