Contributors: Chen Jiayun, Liu Ziyang, Wang Yiyang
We are a group of undergraduate students from lower grades, and this project is our first encounter with data science (NLP) related topics. Overall, it's considered a toy model. Welcome to hack us, :).
The guide for project is in word_representations_biomedical.ipynb, cord19WordVectors.ipynb
we get the data from :
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gzdecompress the file after you get it.
It's vital to set the environment variable NLP_DATA_PATH before make step-1, simply set at the root directory of the data file above is acceptable, step-1.py will scan all the subdirectories and convert all the files ended up with .json.
export NLP_DATA_PATH="/path/to/document_parse"We need to extract all JSON files from a folder to parse raw text. We use the 'os.walk' API to scan all files (including subdirectories) in a specific directory and extract the required content from them. Due to device limitations, we only extracted the 'title' and 'abstract' parts.
We utilized three methods for tokenization, which include using Python's built-in split function, the NLTK library, and the ByteLevelBPETokenizer library. The extracted results were output to the "result" folder (located outside of this GitHub repository).
Check the detail code in Makefile or type make step-2-{method}
We trained two kind of models:
The definition of the model is set in ngram_model.py, some magic number such as dimension of the vector and the particular N for the N-gram.
The code here is primarily derived from this document,
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
and we made slight modifications to make it compatible with our tokens.
The definition of the model is set in skip_gram_model.py.
Actually, we just made change on the guide above and changed it in to another type of training.
There are two kind of output: word embedding and model save.
Check the result directory. The file ended with .json contains the token and its vector.
Check the model directory. The model is saved for further training.
Both of the directories is not included in the git repository, but it will automatically created when you run make step-3-ngram or make step-3-skipgram
we use t-SNE and plot draw all the vectors on a 2D graph
We only plotted a subset of points related to biomedical data to examine the overall distribution trend of the vectors.
We determine which words often co-occur by calculating the correlation coefficients.
We determine which words have similarity by calculating the dot product of their vectors.
Even with a relatively small dataset and limited computational resources, we can still observe that the overall performance of the model has improved significantly. Although the fine-grained classification is not yet accurate, we can trust that its performance will be better with increased data and computational power.