Abstracts-Simplifier

Table of Content

Demo
Objective
Motivation
Data
Approach
Packages/Libraries
Installation
To Do
Contact

Demo

Link: https://share.streamlit.io/rahul1758/abstracts-simplifier/app.py

Unstructured Abstract

Structured Abstract

Working

Objective

The objective of this project is to help researchers in their research. Each researcher has to skim through a lot of research papers trying to find the relevant ones for the topic in their mind. And in doing so they have to read the abstracts of papers to filter the relevant ones. But sometimes it becomes time-consuming if the abstract don't have proper structure. This Webapp uses the concept of Sequential Sentence classification to provide appropriate structure to the Abstracts, making reading easier, quicker & efficient.

Motivation

I've always been terrified of reading long articles with huge paragraphs. Plus if the content lacks structure it adds to the anxiety of reading through the entire paragraph. I wanted to make reading easier & quicker, when I came across this Research paper which does the same but for Medical domain abstract. I've implemented the paper using State-of-the-art BERT Transformer architecture.

Data

The dataset I am using was prepared by the authors of the Research paper. You can download it from his Github link: https://github.com/Franck-Dernoncourt/pubmed-rct There are 2 version of the dataset:

Larger: PubMed_200k_RCT which contains 200k labelled sentences of abstracts in total. There is also a version of this dataset where the numbers mentioned in the abstract is replaced by @ symbol.
Smaller: PubMed_20K_RCT which contains 20k labelled sentences of abstracts in total. There is also a version of this dataset where the numbers mentioned in the abstract is replaced by @ symbol.

I've used the Smaller version (PubMed_20K_RCT) for this project.

Approach

Each abstract in the dataset is represent in following format:

'###24293578\n' -> id denoting start of abstract of a research paper
(Label)\t(Sentence) -> Label along with each sentence in the abstract
(Label)\t(Sentence)
.
.
'\n' -> denoting the end of abstract of research paper

Following is my approach in solving this problem:

Preprocess the data (Converting the raw data into Sentence-Label format)
Feature Engineering (Added 2 custom features namely Line_number & Total_lines. The sentences in the abstract are correlated and derive context from each other.The order of the sentences matter a lot and these 2 features will help the model understand the sequence/order of the input sentences.)
Model Training (I've used BERT model that was trained on MEDLINE/PubMed from scratch from TensorFlow Hub. Training was done on Google Colab.)
Evaluate the model

The Model architecture I've used can be found in this Colab Notebook

Packages/Libraries

Spacy
Streamlit
TensorFlow
TensorFlow-Text

Installation

The Code is written in Python 3.8. If you don't have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip. To install the required packages and libraries, run this command in the project directory after cloning the repository:

pip install -r requirements.txt

Then run the following command which runs the Webapp locally:

streamlit run app.py

That's it!!

To Do

Try and improve F1-score using different architectures. One of the way is used in this Research paper: Pretrained Language Models for Sequential Sentence Classification

References

Contact

If you have suggestions for improvement or any other query, you can reach me at following platform:

Linkedin

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Training Notebook		Training Notebook
gifs & imgs		gifs & imgs
saved_model		saved_model
.gitattributes		.gitattributes
README.md		README.md
app.py		app.py
data_preprocess.py		data_preprocess.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Abstracts-Simplifier

Table of Content

Demo

Unstructured Abstract

Structured Abstract

Working

Objective

Motivation

Data

Approach

Packages/Libraries

Installation

To Do

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Rahul1758/Abstracts-Simplifier

Folders and files

Latest commit

History

Repository files navigation

Abstracts-Simplifier

Table of Content

Demo

Unstructured Abstract

Structured Abstract

Working

Objective

Motivation

Data

Approach

Packages/Libraries

Installation

To Do

References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages