Link: https://share.streamlit.io/rahul1758/abstracts-simplifier/app.py
The objective of this project is to help researchers in their research. Each researcher has to skim through a lot of research papers trying to find the relevant ones for the topic in their mind. And in doing so they have to read the abstracts of papers to filter the relevant ones. But sometimes it becomes time-consuming if the abstract don't have proper structure. This Webapp uses the concept of Sequential Sentence classification to provide appropriate structure to the Abstracts, making reading easier, quicker & efficient.
I've always been terrified of reading long articles with huge paragraphs. Plus if the content lacks structure it adds to the anxiety of reading through the entire paragraph. I wanted to make reading easier & quicker, when I came across this Research paper which does the same but for Medical domain abstract. I've implemented the paper using State-of-the-art BERT Transformer architecture.
The dataset I am using was prepared by the authors of the Research paper. You can download it from his Github link: https://github.com/Franck-Dernoncourt/pubmed-rct There are 2 version of the dataset:
- Larger: PubMed_200k_RCT which contains 200k labelled sentences of abstracts in total. There is also a version of this dataset where the numbers mentioned in the abstract is replaced by @ symbol.
- Smaller: PubMed_20K_RCT which contains 20k labelled sentences of abstracts in total. There is also a version of this dataset where the numbers mentioned in the abstract is replaced by @ symbol.
I've used the Smaller version (PubMed_20K_RCT) for this project.
Each abstract in the dataset is represent in following format:
'###24293578\n' -> id denoting start of abstract of a research paper
(Label)\t(Sentence) -> Label along with each sentence in the abstract
(Label)\t(Sentence)
.
.
'\n' -> denoting the end of abstract of research paper
Following is my approach in solving this problem:
- Preprocess the data (Converting the raw data into Sentence-Label format)
- Feature Engineering (Added 2 custom features namely Line_number & Total_lines. The sentences in the abstract are correlated and derive context from each other.The order of the sentences matter a lot and these 2 features will help the model understand the sequence/order of the input sentences.)
- Model Training (I've used BERT model that was trained on MEDLINE/PubMed from scratch from TensorFlow Hub. Training was done on Google Colab.)
- Evaluate the model
The Model architecture I've used can be found in this Colab Notebook
- Spacy
- Streamlit
- TensorFlow
- TensorFlow-Text
The Code is written in Python 3.8. If you don't have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip. To install the required packages and libraries, run this command in the project directory after cloning the repository:
pip install -r requirements.txt
Then run the following command which runs the Webapp locally:
streamlit run app.py
That's it!!
- Try and improve F1-score using different architectures. One of the way is used in this Research paper: Pretrained Language Models for Sequential Sentence Classification
- PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
- Neural Networks for Joint Sentence Classification in Medical Paper Abstracts
If you have suggestions for improvement or any other query, you can reach me at following platform: