stat-search

Introduction

This project uses Zeppelin and its built-in Spark to analyze the csv file from getstat.com

Approach

Spark is the de-facto solution for performing analytics on large data sets within a distributed computing environment. In this project we have chosen to process a tiny sample of data using a local (single instance) deployment of Spark. This means that it runs on a single machine with all the data available locally. Spark can be easily deployed and configured to run on a cluster. In this case, Spark will automatically deploy executors that will operate on data local to each compute instance. We have every reason to believe, that even a modestly sized cluster will be able to process billions of rows of data every day.
Spark is used to process the CSV file directly without transformation. Once loaded, it is possible to save the CSV as a parquet file. While access to CSV off file is good. Parquet is better and memory is best. In this project, we cache the data in Spark's memory.
Spark is very powerful and has a rich set of functionality available via Python or Scala. The language of choice is Scala, although the solution code is quite simple and uses mainly those operations that could also have been implemented in SQL. Far more powerful operations, such as map-reduce operations can be used to solve more complex operations such as word-counting.

Installation

It is recommended that these steps are followed in a Linux environment such as Ubuntu or RedHat.
You will need to download Zeppelin. Use version 0.7.1. This comes with a built-in version of Spark v2.1.0 which is what we shall use here. (It is easy to install another version of Spark and point Zeppelin to that too.)
First copy the zeppelin-env.sh file provided in this project, into Zeppelin's conf folder. This configures Zeppelin to run on port 8098 and specifies a dependency on the spark-csv package. This provides us support for reading CSV files.
Start zeppelin: enter> ./zeppelin-0.7.1-bin-all/bin/zeppelin-daemon.sh start
Wait for it to start then browse to localhost:8098
Import the getstat.json notebook.
Modify the Load Data into Spark paragraph so that the file path points to the getstat CSV file.
Since this notebook includes the results of the previous run, you don't need to do anything else to see the output. However if you do wish to play around with the notebook. E.g. to modify paragraphs or to see the notebook running, then follow the next step:
While it is possible to do so from the "triangle" icon near the top, I find it easier to click on the "triangle" for each paragraph individually. This can be found to the right of each paragraph.

Notes

The notebook takes advantage of Zeppelin's built-in visualization mechanisms which by the way are interactive.
Zeppelin is a data scientists tool. It's purpose is to explore and process data, typically in an interactive manner. With a little effort, the Scala code here could easily be written as a class with methods providing specific behaviours or functions against the data. The intent being to package the functionality into a JAR and incorporate it into a production pipeline.

Sample Screenshot

Here is a sample output from the notebook:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
getstat.json		getstat.json
market-analysis.jpg		market-analysis.jpg
zeppelin-env.sh		zeppelin-env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stat-search

Introduction

Approach

Installation

Notes

Sample Screenshot

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

stat-search

Introduction

Approach

Installation

Notes

Sample Screenshot

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages