Skip to content

In this repository, we reimplement the paper "Label Semantic Aware Pre-training for Few-shot Text Classification".

Notifications You must be signed in to change notification settings

DigitalVeer/Label-Semantic-Aware-Pretraining

Repository files navigation

Label Semantic Aware Pretraining

In this research project, we explore the effectiveness of LSAP on few-shot intent classification tasks. Our aim is to implement the LSAP technique on a series of T5-small models and evaluate their performance across diverse few-shot settings. The original Label Semantic Aware Pre-training paper can be found here.

Setup

This project is managed using Poetry, an alternative to pip with virtual environment management.

  1. Install Poetry (Powershell).
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
  1. Add Poetry to PATH:
C:\Users\NAME\AppData\Roaming\pypoetry\venv\Scripts
  1. Run in project directory:
poetry config virtualenvs.in-project true
poetry install
  1. To activate the virtual environment, run:
poetry shell

To deactivate the virtual environment, run:

exit

Pip Setup (Altnerative)

Assuming you have pip installed and configured on your system, you can use pip to install the dependencies.

pip install -r requirements.txt

How To Run

To generate data from scratch:

cd scripts
sh generate_data.sh

To pretrain models (requires configuration based on environment):

cd scripts
sh pretrain.sh

The training arguments can be changed inside the pretrain.sh to replicate different models attempted in our paper.

To fine-tune models:

cd scripts
sh fine-tune.sh

Data

Our project relies on a variety of datasets, each playing a key role in different stages, and we provide a concise overview of their significance.

Pretraining

  1. PolyAI Bank: The PolyAI Bank dataset contains banking-related utterances. This dataset serves a large amount of customer intents and is available via the Hugging Face library.

  2. WikiHow: The WikiHow dataset is sourced from the WikiHow website. It pairs the longest step in a WikiHow article with the article title (sans "How To") as its intent.

Evaluation

  1. SNIPS: We use the SNIPS dataset as it is a popular benchmark in intent classification tasks.

  2. ATIS: ATIS (Airline Travel Information System) houses user queries concerning flight reservations, schedules, and other travel-related subjects. Similar to the authors, we use this to evaluate intent classification.

  3. TOPv2: Finally, the TOPv2 dataset developed by Facebook AI encompasses user queries across various domains, including reminders and weather. We use a focus on TOPv2Weather and TOPv2Reminder for this project, as the original authors.

Generation

To generate the pretraining data, run the following script:

sh data/pretraining/preprocess_data.sh

Data Layout

The data that is used throughout our project is all stored under the data folder. The data is stored in the following format:

data
├───pretraining
│   ├───dataset (storage for raw data)
│   │───preprocessed_data (stores tokenized data)
│   ├───polyai-bank
│   │   └───get_data.py (data generator in each dataset)
│   ├────wikihow
│   │   └───get_data.py 
│   preprocessing.py (tokenizes raw dataset & stores them in preprocessed_data)
├───evaluation
│   ├───atis
│   ├───snips
│   ├───tops_reminder
│   ├───tops_weather
│   │───dataset (storage for raw data)
│   preprocessing.py (stores datasets into dataset folder)

👥 Authors

About

In this repository, we reimplement the paper "Label Semantic Aware Pre-training for Few-shot Text Classification".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages