The aim of the k-means clustering algorithm is to take a given dataset and partition these observations up into k number of clusters. Deciding on where these clusters lie is computationally difficult but each cluster should be representative of all its surrounding data points. It algorithm is unsupervised learning method and hopes to gain some sort of insight into data that may be unseemingly unrelated.
- Given that k-means clustering problem is a NP-Hard computationally intensive problem, a distributed approach seems appropriate. By dividing up the computation load among many worker nodes, the program can be parallelized and sped up.
- The root node reads in the entire dataset and randomly chooses the initial k-means for each cluster.
- The root node broadcasts the k-means to each n number of workers.
- The root node distributes the dataset among n number of workers.
- Given its portion of the entire dataset, each worker does computational work against each k-means cluster and returns the result to the root.
- The algorithm repeats for X number of iterations.
- Docker Engine
- Docker base image of Alpine Linux - Credit to NLKNguyen for taking the time to create this image
- MPICH - included in Docker base image
- Install Docker
- Make sure Docker is running:
$ sudo service docker start
- Download NLKNguyen's prebuilt Docker base image:
$ docker pull nlknguyen/alpine-mpich
$ docker pull nlknguyen/alpine-mpich:onbuild
- While in the directory containing source code, run the following the launch the image:
$ docker run --rm -it -v $(PWD):/project nlknguyen/alpine-mpich
- Now with the Docker image running, compile the source code:
$ mpicc k-means.c -o k-means
- Now run the program with the desired number of processes. Include an argument in the command line right after the program name to match the number of processes:
$ mpirun -n 3 ./k-means 3
- Creating too many worker nodes in hopes of further speeding up parallelization can actually impede its progress. There must be a balance between the workload to be distributed to each worker, and the actual number of worker nodes. If there are more workers than necessary, much of the computational efforts are spent distributing the work (message passing), and not spent computing the actual workload. Thus one of the defining characteristics of how well the distributed program will run is its message complexity and how many messages are being sent back and forth throughout the network.
- Further optimize the algorithm by reducing message complexity
- Deploy on Amazon AWS cluster and test