SciPrompt
is a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks, including few-shot and zero-shot settings.
- (2025.03.16)
Emerging NLP
dataset is available on 🤗 Hugging Face! - (2025.03.16) Our datasets can be downloaded here.
- (2024.12.01) Download the fine-tuned
filtering models
:
This project is developed based on the OpenPrompt framework.
To install the necessary Python packages, clone this repo and then run the following command:
conda create -n sciprompt python=3.8.12
pip install -r requirements.txt
- Replace the placeholder paths in the script with actual paths to your data and configuration files:
--data_dir
should point to your data directory--verbalizer_path
should point to your arXiv_knowledgable_verbalizer.txt--semantic_score_path
should point to your arXiv_knowledgable_verbalizer_semantic_search_scores.txt--doc_id_path
should point to your doc_id.txt--config_path
should point to config/arxiv_label_mappings.json
- Prepare your class label dictionary similar to the
.json
files in the label_mappings folder
-
Run through our datasets:
- Step 1: Change paths in
run_retrieval.sh
and runbash run_retrieval.sh
- Step 2: Change paths of the filtering model, retrieved data (from Step 1), and output files in the
run_knowledge_filtering.sh
script - Step 3: Run the filtering script:
bash run_knowledge_filtering.sh
- Step 1: Change paths in
-
Run using your own dataset:
- Step 1 and 2 are the same as above
- Step 3: Change your dataset name as
custom
and corresponding configs into thedataset_configs
dictionary inknowledge_filtering.py
Line 206 - Run
bash run_knowledge_filtering_customized.sh
- Execute scripts for each dataset:
bash run_arxiv.sh
bash run_s2orc.sh
bash run_sdpra.sh
- Run on your own data (need two input files: one only contains data, one only has labels, as used in arXiv):
bash run_custom_script.sh
Note: Please modify the required data file paths inside each script before running.
For the use of SciPrompt and Emerging NLP benchmark, please cite:
@inproceedings{you-etal-2024-sciprompt,
title = "{S}ci{P}rompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics",
author = "You, Zhiwen and
Han, Kanyao and
Zhu, Haotian and
Ludaescher, Bertram and
Diesner, Jana",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.350",
pages = "6087--6104",
}
If you have any questions, please email [email protected]
.