In this research project, we explore the effectiveness of LSAP on few-shot intent classification tasks. Our aim is to implement the LSAP technique on a series of T5-small models and evaluate their performance across diverse few-shot settings. The original Label Semantic Aware Pre-training paper can be found here.
This project is managed using Poetry, an alternative to pip with virtual environment management.
- Install Poetry (Powershell).
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -- Add Poetry to PATH:
C:\Users\NAME\AppData\Roaming\pypoetry\venv\Scripts
- Run in project directory:
poetry config virtualenvs.in-project true
poetry install
- To activate the virtual environment, run:
poetry shell
To deactivate the virtual environment, run:
exit
Assuming you have pip installed and configured on your system, you can use pip to install the dependencies.
pip install -r requirements.txt
To generate data from scratch:
cd scripts
sh generate_data.sh
To pretrain models (requires configuration based on environment):
cd scripts
sh pretrain.sh
The training arguments can be changed inside the pretrain.sh to replicate different models attempted in our paper.
To fine-tune models:
cd scripts
sh fine-tune.sh
Our project relies on a variety of datasets, each playing a key role in different stages, and we provide a concise overview of their significance.
-
PolyAI Bank: The PolyAI Bank dataset contains banking-related utterances. This dataset serves a large amount of customer intents and is available via the Hugging Face library.
-
WikiHow: The WikiHow dataset is sourced from the WikiHow website. It pairs the longest step in a WikiHow article with the article title (sans "How To") as its intent.
-
SNIPS: We use the SNIPS dataset as it is a popular benchmark in intent classification tasks.
-
ATIS: ATIS (Airline Travel Information System) houses user queries concerning flight reservations, schedules, and other travel-related subjects. Similar to the authors, we use this to evaluate intent classification.
-
TOPv2: Finally, the TOPv2 dataset developed by Facebook AI encompasses user queries across various domains, including reminders and weather. We use a focus on TOPv2Weather and TOPv2Reminder for this project, as the original authors.
To generate the pretraining data, run the following script:
sh data/pretraining/preprocess_data.shThe data that is used throughout our project is all stored under the data folder. The data is stored in the following format:
data
├───pretraining
│ ├───dataset (storage for raw data)
│ │───preprocessed_data (stores tokenized data)
│ ├───polyai-bank
│ │ └───get_data.py (data generator in each dataset)
│ ├────wikihow
│ │ └───get_data.py
│ preprocessing.py (tokenizes raw dataset & stores them in preprocessed_data)
├───evaluation
│ ├───atis
│ ├───snips
│ ├───tops_reminder
│ ├───tops_weather
│ │───dataset (storage for raw data)
│ preprocessing.py (stores datasets into dataset folder)