Skip to content

Latest commit

 

History

History
49 lines (28 loc) · 2.38 KB

File metadata and controls

49 lines (28 loc) · 2.38 KB

bigdata-example-project

Using KMeans clustering with Euclidean distance measure to group together similar data points into 8 clusters. And then reporting the Sum Squared Error of the resulting clusters.

Objective is to run analysis algorithm on openstack cloud, by ansiblizing the major steps. For this we have to use ansible scripts to create the VMs, setup hadoop cluster, install required softwares, retrieve and upload the dataset into HDFS, and copy analysis code to Master-node of Hadoop Cluster. Login to master node, run the analysis code on the data in HDFS, retrieve the results, and show the output of algorithm ran.

youtube1

Results: The KMeans algorithm, when ran for 30 iterations on 13,700+ records for 8 clusters, the resulting sum squared error (SSE) was coming around 6300 ± 500. We ran our source multiple times from scratch.

Implementation: The entry point to run this project is executing launch.sh present at /src. The /src/twitter/ contains the main source code:

site.yml
|--software.yml  // install necessary softwares on the VM
|--dataset.yml   // retrieve the dataset and upload it to HDFS
|--analysis.yml  // copy the analysis code-base

which will install necessary softwares on the VM, retrieve the dataset and upload it to HDFS, copy the following analysis code-base:

main.sh
|--twitter.sbt
|--kmeans.demo.scala

to the master node.

To know how to run this project, refer the installation.rst file. To see a sample video demo of this project, click at -

youtube2

References: