Skip to content

A distributed implementation of k-means clustering using message passing

Notifications You must be signed in to change notification settings

dzdao/k-means-clustering-mpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

k-Means Clustering: A Distributed MPI Implementation

The aim of the k-means clustering algorithm is to take a given dataset and partition these observations up into k number of clusters. Deciding on where these clusters lie is computationally difficult but each cluster should be representative of all its surrounding data points. It algorithm is unsupervised learning method and hopes to gain some sort of insight into data that may be unseemingly unrelated.

Distributed Functionalities

  • Given that k-means clustering problem is a NP-Hard computationally intensive problem, a distributed approach seems appropriate. By dividing up the computation load among many worker nodes, the program can be parallelized and sped up.
  • The root node reads in the entire dataset and randomly chooses the initial k-means for each cluster.
  • The root node broadcasts the k-means to each n number of workers.
  • The root node distributes the dataset among n number of workers.
  • Given its portion of the entire dataset, each worker does computational work against each k-means cluster and returns the result to the root.
  • The algorithm repeats for X number of iterations.

Frameworks, modules, libraries, etc:

Deployment and running the program:

  1. Install Docker
  2. Make sure Docker is running:
$ sudo service docker start
  1. Download NLKNguyen's prebuilt Docker base image:
$ docker pull nlknguyen/alpine-mpich
$ docker pull nlknguyen/alpine-mpich:onbuild
  1. While in the directory containing source code, run the following the launch the image:
$ docker run --rm -it -v $(PWD):/project nlknguyen/alpine-mpich
  1. Now with the Docker image running, compile the source code:
$ mpicc k-means.c -o k-means
  1. Now run the program with the desired number of processes. Include an argument in the command line right after the program name to match the number of processes:
$ mpirun -n 3 ./k-means 3

Current Limitations

  • Creating too many worker nodes in hopes of further speeding up parallelization can actually impede its progress. There must be a balance between the workload to be distributed to each worker, and the actual number of worker nodes. If there are more workers than necessary, much of the computational efforts are spent distributing the work (message passing), and not spent computing the actual workload. Thus one of the defining characteristics of how well the distributed program will run is its message complexity and how many messages are being sent back and forth throughout the network.

Future Goals

  • Further optimize the algorithm by reducing message complexity
  • Deploy on Amazon AWS cluster and test

About

A distributed implementation of k-means clustering using message passing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages