It's written for Python 2.7.8 and it's based on YAJL and its Python bindings yajl-py
/data
/parser
These installation instructions are for Ubuntu 14.04 installed with python 2.7.8.
- Install ruby and cmake since these form the dependencies for yajl
sudo apt-get install ruby cmake
- Clone YAJL
git clone https://github.com/lloyd/yajl.git
- Go to the cloned repo and
./configure && make install
- You can create a sandboxed virtualenv at this step so that system configuration is not disturbed
- Activate the virtual environment and install yajl-py
pip install yajl-py==2.1.1
- Ensure that YAJL and yajl-py are of the same version 2.1.1
Download the sample data and store it in /data
After installing the pre-requisite libraries, you can run the tweet parser by
python /parser/tweet_parser.py > /data/video_ids.txt
Instead of storing the complete URLs, the parser only stores the 11 digit Video IDs. Given the Video IDs it is very trivial to construct the URLs by appending http://www.youtube.com/.
Duplicates can be removed by
awk '!seen[$0]++' /data/video_ids.txt