Skip to content

SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics (EMNLP 2024)

License

Notifications You must be signed in to change notification settings

zhiwenyou103/SciPrompt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMNLP ACL Dataset

SciPrompt

SciPrompt is a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks, including few-shot and zero-shot settings.

News

This project is developed based on the OpenPrompt framework.

Overall Framework

Installation

To install the necessary Python packages, clone this repo and then run the following command:

conda create -n sciprompt python=3.8.12
pip install -r requirements.txt

Prepare the Required Files and Directories

  • Replace the placeholder paths in the script with actual paths to your data and configuration files:
    • --data_dir should point to your data directory
    • --verbalizer_path should point to your arXiv_knowledgable_verbalizer.txt
    • --semantic_score_path should point to your arXiv_knowledgable_verbalizer_semantic_search_scores.txt
    • --doc_id_path should point to your doc_id.txt
    • --config_path should point to config/arxiv_label_mappings.json
  • Prepare your class label dictionary similar to the .json files in the label_mappings folder

Knowledge Retrieval and Filtering

  • Run through our datasets:

    • Step 1: Change paths in run_retrieval.sh and run bash run_retrieval.sh
    • Step 2: Change paths of the filtering model, retrieved data (from Step 1), and output files in the run_knowledge_filtering.sh script
    • Step 3: Run the filtering script:
      bash run_knowledge_filtering.sh
  • Run using your own dataset:

    • Step 1 and 2 are the same as above
    • Step 3: Change your dataset name as custom and corresponding configs into the dataset_configs dictionary in knowledge_filtering.py Line 206
    • Run bash run_knowledge_filtering_customized.sh

Run the main script:

  • Execute scripts for each dataset:
bash run_arxiv.sh
bash run_s2orc.sh
bash run_sdpra.sh
  • Run on your own data (need two input files: one only contains data, one only has labels, as used in arXiv):
bash run_custom_script.sh

Note: Please modify the required data file paths inside each script before running.

Citation Information

For the use of SciPrompt and Emerging NLP benchmark, please cite:

@inproceedings{you-etal-2024-sciprompt,
    title = "{S}ci{P}rompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics",
    author = "You, Zhiwen  and
      Han, Kanyao  and
      Zhu, Haotian  and
      Ludaescher, Bertram  and
      Diesner, Jana",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.350",
    pages = "6087--6104",
}

Contact Information

If you have any questions, please email [email protected].

About

SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics (EMNLP 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published