Skip to content

shahkushan1/tweet_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

High Volume JSON Parser Example

It's written for Python 2.7.8 and it's based on YAJL and its Python bindings yajl-py

Directory Structure

/data /parser

Installation

These installation instructions are for Ubuntu 14.04 installed with python 2.7.8.

  1. Install ruby and cmake since these form the dependencies for yajl sudo apt-get install ruby cmake
  2. Clone YAJL git clone https://github.com/lloyd/yajl.git
  3. Go to the cloned repo and ./configure && make install
  4. You can create a sandboxed virtualenv at this step so that system configuration is not disturbed
  5. Activate the virtual environment and install yajl-py pip install yajl-py==2.1.1
  6. Ensure that YAJL and yajl-py are of the same version 2.1.1

How to use it

Download the sample data and store it in /data

After installing the pre-requisite libraries, you can run the tweet parser by

python /parser/tweet_parser.py > /data/video_ids.txt

Instead of storing the complete URLs, the parser only stores the 11 digit Video IDs. Given the Video IDs it is very trivial to construct the URLs by appending http://www.youtube.com/.

Duplicates can be removed by

awk '!seen[$0]++' /data/video_ids.txt

About

Streaming JSON Parser Example code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages