Skip to content

mklosi/word-count

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to WordCount, a "Hello World" for big data!

Overview

The goal of this project is to produce a list of word counts for words that appear in the bodies of the Enron emails (ignoring the headers.) Only words that appear 10 or more times will be included. The produced list will be sorted in order of count descending. Output is a single CSV file.

How to run with Spark on a local machine
  • Download the Enron email dataset from here
  • Uncompress the dataset to a local location. This will be your $input location.
  • If you don't have sbt and/or Spark install them using Homebrew now. On your terminal, enter
    • brew install sbt to install sbt
    • brew install apache-spark to install Spark
  • clone this repo to your local machine by running git clone https://github.com/mklosi/word-count.git
  • cd into the cloned project dir and build the project, along with its far jar using sbt assembly
  • when done, run spark-submit --master local[*] --driver-memory 2g --class WordCountJob <$projectdir>/target/scala-2.11/word-count-assembly-0.1.jar --input-path "<$input path>" --output-path "<$output path>"
Assumptions
  • Each file represents a single email message.
Issues
  • From the limited number of example emails I inspected, it seems that the headers for each email are largely consistent. It's 15 lines of headers, followed by one empty line, followed by the email body. However, some emails have missing headers, but that will not effect the word counts generated by this job.
  • In a lot of cases, there are headers mixed along with email bodies, for example, when an email body quotes previous emails in a thread. This job does not handle those cases.
  • This job also does not stem words or remove stop words (or any characters). This could be handled in future releases.

About

WordCount, a "Hello World" for big data!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages