NLP-Project for 2023 Data Science Summer School Imperial College London

Contributors: Chen Jiayun, Liu Ziyang, Wang Yiyang

We are a group of undergraduate students from lower grades, and this project is our first encounter with data science (NLP) related topics. Overall, it's considered a toy model. Welcome to hack us, :）.

The guide for project is in word_representations_biomedical.ipynb, cord19WordVectors.ipynb

Data Preparation

Get data

we get the data from :

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz

decompress the file after you get it.

Set environment variable

It's vital to set the environment variable NLP_DATA_PATH before make step-1, simply set at the root directory of the data file above is acceptable, step-1.py will scan all the subdirectories and convert all the files ended up with .json.

export NLP_DATA_PATH="/path/to/document_parse"

Explanation for each step

Step 1

We need to extract all JSON files from a folder to parse raw text. We use the 'os.walk' API to scan all files (including subdirectories) in a specific directory and extract the required content from them. Due to device limitations, we only extracted the 'title' and 'abstract' parts.

Step 2

We utilized three methods for tokenization, which include using Python's built-in split function, the NLTK library, and the ByteLevelBPETokenizer library. The extracted results were output to the "result" folder (located outside of this GitHub repository).

Check the detail code in Makefile or type make step-2-{method}

Step 3

We trained two kind of models:

N-gram model

The definition of the model is set in ngram_model.py, some magic number such as dimension of the vector and the particular N for the N-gram.

The code here is primarily derived from this document,

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

and we made slight modifications to make it compatible with our tokens.

skip-gram model

The definition of the model is set in skip_gram_model.py.

Actually, we just made change on the guide above and changed it in to another type of training.

Output

There are two kind of output: word embedding and model save.

Check the result directory. The file ended with .json contains the token and its vector.

Check the model directory. The model is saved for further training.

Both of the directories is not included in the git repository, but it will automatically created when you run make step-3-ngram or make step-3-skipgram

Step 4

tSNE

we use t-SNE and plot draw all the vectors on a 2D graph

bio-t-SNE

We only plotted a subset of points related to biomedical data to examine the overall distribution trend of the vectors.

co-occurrence

We determine which words often co-occur by calculating the correlation coefficients.

similar

We determine which words have similarity by calculating the dot product of their vectors.

Summary

Even with a relatively small dataset and limited computational resources, we can still observe that the overall performance of the model has improved significantly. Although the fine-grained classification is not yet accurate, we can trust that its performance will be better with increased data and computational power.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
cord19WordVectors.ipynb		cord19WordVectors.ipynb
ngram_model.py		ngram_model.py
skip_gram_model.py		skip_gram_model.py
step-1.py		step-1.py
step-2-BPE.py		step-2-BPE.py
step-2-nltk.py		step-2-nltk.py
step-2-split.py		step-2-split.py
step-3-n-gram.py		step-3-n-gram.py
step-3-skip-gram.py		step-3-skip-gram.py
step-4_1-tSNE.py		step-4_1-tSNE.py
step-4_2-biotSNE.py		step-4_2-biotSNE.py
step-4_3-cocurrence.py		step-4_3-cocurrence.py
step-4_4-similar.py		step-4_4-similar.py
word_representations_biomedical.ipynb		word_representations_biomedical.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Project for 2023 Data Science Summer School Imperial College London

Data Preparation

Get data

Set environment variable

Explanation for each step

Step 1

Step 2

Step 3

N-gram model

skip-gram model

Output

Step 4

tSNE

bio-t-SNE

co-occurrence

similar

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Languages

chenjiayun212/NLP_project

Folders and files

Latest commit

History

Repository files navigation

NLP-Project for 2023 Data Science Summer School Imperial College London

Data Preparation

Get data

Set environment variable

Explanation for each step

Step 1

Step 2

Step 3

N-gram model

skip-gram model

Output

Step 4

tSNE

bio-t-SNE

co-occurrence

similar

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages