Welcome to WordCount, a "Hello World" for big data!

Overview

The goal of this project is to produce a list of word counts for words that appear in the bodies of the Enron emails (ignoring the headers.) Only words that appear 10 or more times will be included. The produced list will be sorted in order of count descending. Output is a single CSV file.

How to run with Spark on a local machine

Download the Enron email dataset from here
Uncompress the dataset to a local location. This will be your $input location.
If you don't have sbt and/or Spark install them using Homebrew now. On your terminal, enter
- brew install sbt to install sbt
- brew install apache-spark to install Spark
clone this repo to your local machine by running git clone https://github.com/mklosi/word-count.git
cd into the cloned project dir and build the project, along with its far jar using sbt assembly
when done, run spark-submit --master local[*] --driver-memory 2g --class WordCountJob <$projectdir>/target/scala-2.11/word-count-assembly-0.1.jar --input-path "<$input path>" --output-path "<$output path>"

Assumptions

Each file represents a single email message.

Issues

From the limited number of example emails I inspected, it seems that the headers for each email are largely consistent. It's 15 lines of headers, followed by one empty line, followed by the email body. However, some emails have missing headers, but that will not effect the word counts generated by this job.
In a lot of cases, there are headers mixed along with email bodies, for example, when an email body quotes previous emails in a thread. This job does not handle those cases.
This job also does not stem words or remove stop words (or any characters). This could be handled in future releases.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
project		project
src		src
.gitignore		.gitignore
DataEngineerChallenge.pdf		DataEngineerChallenge.pdf
README.md		README.md
assembly.sbt		assembly.sbt
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Welcome to WordCount, a "Hello World" for big data!

Overview

How to run with Spark on a local machine

Assumptions

Issues

About

Uh oh!

Releases

Packages

Languages

mklosi/word-count

Folders and files

Latest commit

History

Repository files navigation

Welcome to WordCount, a "Hello World" for big data!

Overview

How to run with Spark on a local machine

Assumptions

Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages