k-Means Clustering: A Distributed MPI Implementation

The aim of the k-means clustering algorithm is to take a given dataset and partition these observations up into k number of clusters. Deciding on where these clusters lie is computationally difficult but each cluster should be representative of all its surrounding data points. It algorithm is unsupervised learning method and hopes to gain some sort of insight into data that may be unseemingly unrelated.

Distributed Functionalities

Given that k-means clustering problem is a NP-Hard computationally intensive problem, a distributed approach seems appropriate. By dividing up the computation load among many worker nodes, the program can be parallelized and sped up.
The root node reads in the entire dataset and randomly chooses the initial k-means for each cluster.
The root node broadcasts the k-means to each n number of workers.
The root node distributes the dataset among n number of workers.
Given its portion of the entire dataset, each worker does computational work against each k-means cluster and returns the result to the root.
The algorithm repeats for X number of iterations.

Frameworks, modules, libraries, etc:

Docker Engine
Docker base image of Alpine Linux - Credit to NLKNguyen for taking the time to create this image
MPICH - included in Docker base image

Deployment and running the program:

Install Docker
Make sure Docker is running:

$ sudo service docker start

Download NLKNguyen's prebuilt Docker base image:

$ docker pull nlknguyen/alpine-mpich
$ docker pull nlknguyen/alpine-mpich:onbuild

While in the directory containing source code, run the following the launch the image:

$ docker run --rm -it -v $(PWD):/project nlknguyen/alpine-mpich

Now with the Docker image running, compile the source code:

$ mpicc k-means.c -o k-means

Now run the program with the desired number of processes. Include an argument in the command line right after the program name to match the number of processes:

$ mpirun -n 3 ./k-means 3

Current Limitations

Creating too many worker nodes in hopes of further speeding up parallelization can actually impede its progress. There must be a balance between the workload to be distributed to each worker, and the actual number of worker nodes. If there are more workers than necessary, much of the computational efforts are spent distributing the work (message passing), and not spent computing the actual workload. Thus one of the defining characteristics of how well the distributed program will run is its message complexity and how many messages are being sent back and forth throughout the network.

Future Goals

Further optimize the algorithm by reducing message complexity
Deploy on Amazon AWS cluster and test

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
k-means.c		k-means.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

k-Means Clustering: A Distributed MPI Implementation

Distributed Functionalities

Frameworks, modules, libraries, etc:

Deployment and running the program:

Current Limitations

Future Goals

About

Uh oh!

Releases

Packages

Languages

dzdao/k-means-clustering-mpi

Folders and files

Latest commit

History

Repository files navigation

k-Means Clustering: A Distributed MPI Implementation

Distributed Functionalities

Frameworks, modules, libraries, etc:

Deployment and running the program:

Current Limitations

Future Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages