The goal of this project is to produce a list of word counts for words that appear in the bodies of the Enron emails (ignoring the headers.) Only words that appear 10 or more times will be included. The produced list will be sorted in order of count descending. Output is a single CSV file.
- Download the Enron email dataset from here
- Uncompress the dataset to a local location. This will be your $input location.
- If you don't have sbt and/or Spark install them using Homebrew now.
On your terminal, enter
brew install sbt
to install sbtbrew install apache-spark
to install Spark
- clone this repo to your local machine by running
git clone https://github.com/mklosi/word-count.git
- cd into the cloned project dir and build the project, along with its far jar using
sbt assembly
- when done, run
spark-submit --master local[*] --driver-memory 2g --class WordCountJob <$projectdir>/target/scala-2.11/word-count-assembly-0.1.jar --input-path "<$input path>" --output-path "<$output path>"
- Each file represents a single email message.
- From the limited number of example emails I inspected, it seems that the headers for each email are largely consistent. It's 15 lines of headers, followed by one empty line, followed by the email body. However, some emails have missing headers, but that will not effect the word counts generated by this job.
- In a lot of cases, there are headers mixed along with email bodies, for example, when an email body quotes previous emails in a thread. This job does not handle those cases.
- This job also does not stem words or remove stop words (or any characters). This could be handled in future releases.