Label Semantic Aware Pretraining

In this research project, we explore the effectiveness of LSAP on few-shot intent classification tasks. Our aim is to implement the LSAP technique on a series of T5-small models and evaluate their performance across diverse few-shot settings. The original Label Semantic Aware Pre-training paper can be found here.

Setup

This project is managed using Poetry, an alternative to pip with virtual environment management.

Install Poetry (Powershell).

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

Add Poetry to PATH:

C:\Users\NAME\AppData\Roaming\pypoetry\venv\Scripts

Run in project directory:

poetry config virtualenvs.in-project true
poetry install

To activate the virtual environment, run:

poetry shell

To deactivate the virtual environment, run:

exit

Pip Setup (Altnerative)

Assuming you have pip installed and configured on your system, you can use pip to install the dependencies.

pip install -r requirements.txt

How To Run

To generate data from scratch:

cd scripts
sh generate_data.sh

To pretrain models (requires configuration based on environment):

cd scripts
sh pretrain.sh

The training arguments can be changed inside the pretrain.sh to replicate different models attempted in our paper.

To fine-tune models:

cd scripts
sh fine-tune.sh

Data

Our project relies on a variety of datasets, each playing a key role in different stages, and we provide a concise overview of their significance.

Pretraining

PolyAI Bank: The PolyAI Bank dataset contains banking-related utterances. This dataset serves a large amount of customer intents and is available via the Hugging Face library.
WikiHow: The WikiHow dataset is sourced from the WikiHow website. It pairs the longest step in a WikiHow article with the article title (sans "How To") as its intent.

Evaluation

SNIPS: We use the SNIPS dataset as it is a popular benchmark in intent classification tasks.
ATIS: ATIS (Airline Travel Information System) houses user queries concerning flight reservations, schedules, and other travel-related subjects. Similar to the authors, we use this to evaluate intent classification.
TOPv2: Finally, the TOPv2 dataset developed by Facebook AI encompasses user queries across various domains, including reminders and weather. We use a focus on TOPv2Weather and TOPv2Reminder for this project, as the original authors.

Generation

To generate the pretraining data, run the following script:

sh data/pretraining/preprocess_data.sh

Data Layout

The data that is used throughout our project is all stored under the data folder. The data is stored in the following format:

data
├───pretraining
│   ├───dataset (storage for raw data)
│   │───preprocessed_data (stores tokenized data)
│   ├───polyai-bank
│   │   └───get_data.py (data generator in each dataset)
│   ├────wikihow
│   │   └───get_data.py 
│   preprocessing.py (tokenizes raw dataset & stores them in preprocessed_data)
├───evaluation
│   ├───atis
│   ├───snips
│   ├───tops_reminder
│   ├───tops_weather
│   │───dataset (storage for raw data)
│   preprocessing.py (stores datasets into dataset folder)

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
analysis		analysis
data		data
models		models
nlp_project/eval		nlp_project/eval
scripts		scripts
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
paper.pdf		paper.pdf
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Label Semantic Aware Pretraining

Setup

Pip Setup (Altnerative)

How To Run

Data

Pretraining

Evaluation

Generation

Data Layout

👥 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

DigitalVeer/Label-Semantic-Aware-Pretraining

Folders and files

Latest commit

History

Repository files navigation

Label Semantic Aware Pretraining

Setup

Pip Setup (Altnerative)

How To Run

Data

Pretraining

Evaluation

Generation

Data Layout

👥 Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages