Skip to content

SabreLabs/gbta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GBTA2014

Script order

  1. Replace Windows line termination from Raw Redis output with Unix EOL
$ awk -f scripts/win2unixeol.awk /path/to/raw/redis.dump > /path/to/unix'd/eol/output.dump
  1. Strip out all non-JSON valid lines (we don't try to fix partial transciption errors)
$ ruby scripts/strip_blanks.rb /path/to/unix'd/eol.output /path/to/valid/tweet/per/line.txt
  1. Create valid Tre-View capable JSON
$ ruby scripts/tre_view_raw_tweets.rb /path/to/valid/tweet/per/line.output /path/to/tre_viewed/tweets.json
  1. (Optional, tar it up)
$ tar -jvcf [search_term].tar.bz2 /path/to/valid/tweet/per/line.txt /path/to/tre_viewed/tweets.json

Once step #2 is completed, the data is technically ready for a full batch load - each line is a valid Tweet encoded via JSON. Step 3 is simply for tre-viewed purposes, and is completely optional. Step 4 is also optional, but recommended do to overall size.

About

A small handful of scripts for parsing raw twitter data related to GBTA2014

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published