Skip to content
Neal W Morton edited this page Nov 5, 2019 · 17 revisions

Installation

Windows 10

First, install Windows Subsystem for Linux and a Linux distribution such as Ubuntu.

Install tools for compiling code:

sudo apt-get make
sudo apt-get gcc

Then, follow the general instructions below, using the Ubuntu terminal.

General instructions

Download the project code from github. In the terminal:

git clone https://github.com/prestonlab/wiki2vec.git

Install necessary python packages:

pip install numpy
pip install nltk

Then, set up the package:

cd wiki2vec/word2vec
make
cd ..
. setup.sh

This will compile a modified version of the word2vec code, which is needed to read the vectors provided by Google. It will also add necessary scripts to your PATH and PYTHONPATH. Note that setup.sh must be sourced (i.e., . setup.sh) each time you open a new Terminal window. Alternatively, you can place the source command in your $HOME/.bashrc file so it will run each time you start a new terminal.

Downloading data

Download a Wikipedia dump for whatever date is relevant for your study:

curl -o enwiki-pages-articles.xml.bz2 [URL to dump XML file]

For example:

curl -o enwiki-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20191020/enwiki-20191020-pages-articles.xml.bz2

Download word2vec:

curl -o GoogleNews-vectors-negative300.bin.gz https://dl.dropboxusercontent.com/s/dmwk9qwwkirboc8/GoogleNews-vectors-negative300.bin.gz
gunzip GoogleNews-vectors-negative300.bin.gz

Using wiki2vec to measure conceptual similarity

See the tutorial for instructions on modeling the semantic similarity of a set of 60 famous people and 60 famous places.

Clone this wiki locally