Data Science Smorgasbord

Statistics 404

February 28, 2017

🌴 Ryan R. Rosario 🌺

Materials from a three-hour lecture I gave to UCLA Statistics 404 (Professional Masters in Applied Statistics) on Unix command line, Python 🐍 and PyData.

Raw Data

If you wish to work with the raw Reddit comments data, you can download them on a monthly basis from here.

As Reddit gets more and more use, the files get larger and larger. As of February 2017, compressed .bz2 files amass upwards of 8GB.

Once you've downloaded a dump file, you can decompress it with

bunzip2 file.bz2

This takes a while. As of February 2017, uncompressed files are about 35GB per month, all in one file.

You can concatenate multiple months into one file if you wish, using the cat command.

cat file_1 file_2 > mybigfile

Then you can use a tool like jq to parse the JSON files without reading it into Python first.

Processed Data for Lecture

To follow along with the lecture, you don't need to download any raw data.

The file political_comments_sorted.tsv is the file created on slide 60. It can be used for all previous slides as well, but due to space and bandwidth constraints, I only provide one version. The sorted version sorts the file by username.

The files goodusers.dat and usernames.dat are intermediate files created on slides 55 and 50 respectively in case you wish to skip the code on those slides.

The final dataset that does not contain any bots is in the file political_comments_clean.tsv.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
pydata-demo.ipynb		pydata-demo.ipynb
slides-1up.pdf		slides-1up.pdf
slides-4up.pdf		slides-4up.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Smorgasbord

Statistics 404

February 28, 2017

🌴 Ryan R. Rosario 🌺

Raw Data

Processed Data for Lecture

About

Uh oh!

Releases

Packages

Languages

RyanRosario/data-science-smorgasbord

Folders and files

Latest commit

History

Repository files navigation

Data Science Smorgasbord

Statistics 404

February 28, 2017

🌴 Ryan R. Rosario 🌺

Raw Data

Processed Data for Lecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages