bigdata-example-project

Using KMeans clustering with Euclidean distance measure to group together similar data points into 8 clusters. And then reporting the Sum Squared Error of the resulting clusters.

Objective is to run analysis algorithm on openstack cloud, by ansiblizing the major steps. For this we have to use ansible scripts to create the VMs, setup hadoop cluster, install required softwares, retrieve and upload the dataset into HDFS, and copy analysis code to Master-node of Hadoop Cluster. Login to master node, run the analysis code on the data in HDFS, retrieve the results, and show the output of algorithm ran.

Results: The KMeans algorithm, when ran for 30 iterations on 13,700+ records for 8 clusters, the resulting sum squared error (SSE) was coming around 6300 ± 500. We ran our source multiple times from scratch.

Implementation: The entry point to run this project is executing launch.sh present at /src. The /src/twitter/ contains the main source code:

site.yml
|--software.yml  // install necessary softwares on the VM
|--dataset.yml   // retrieve the dataset and upload it to HDFS
|--analysis.yml  // copy the analysis code-base

which will install necessary softwares on the VM, retrieve the dataset and upload it to HDFS, copy the following analysis code-base:

main.sh
|--twitter.sbt
|--kmeans.demo.scala

to the master node.

To know how to run this project, refer the installation.rst file. To see a sample video demo of this project, click at -

References:

Academic learnings from CSCI.I590.Topics In Informatics: Projects On Big Data Software by Professor Geoffrey Charles Fox
The sample dataset of Emotion Vectors for tweets is obtained from my previous work
Using KMeans referred from MLLib KMeans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigdata-example-project

FilesExpand file tree

README.rst

Latest commit

History

README.rst

File metadata and controls

bigdata-example-project