Efficient top-K Vector Similarity Search Queries on MNIST Images Dataset using custom Locality Sensitive Hashing 🔢
Suppose we have two sets of entities represented by vectors (images in our case), namely a query-set
and a train-set
. We define our problem
as the task of finding for each image of the query-set, the
In order to produce results in a time efficient manner, we have to employ probabilistic methods that group similar items (clustering
) and allow for
comparisons within smaller groups of similar items with high probability. We implement the following novel methods:
-
Locality-Sensitive-Hashing - random vectors following normal distribution are chosen from the hyperspace defined by the images. These constitute borders for subspaces of the initial space. Those areas are populated by the train images. For each query image, its target subspace is infered and it is compared to its neighbours within the area, resulting in similarity search with a smaller part of the initial train set.
-
Hypercube-Random-Projection - a novel method using an LSH family of functions for Randomized-Projections of vectors' points to a Hamming Cube of
$log(n)$ dimensionality, where$n$ is the number of pixels of the images. Examines points which are assigned to the same or nearby vertices on the Hamming cube. -
Clustering - in the simple version the
centroids
are chosen in the images' hyperplane using a probabilistic algorithm that maximizes the chance for their even distribution. Images are distributed to the corresponding clusters using Lloyd's Algorithm, then the centroids are updated using K-Medians. We develop two novel clustering methods:
1. LSH Range Searching - emulates the iterative increase of the range within we search for neighbours of the centroids, in each
iteration points are assigned to the nearest centroids
2. Hypercube Range Searching - similar to the upper approach but with the usage of Hamming Hypercubes
The clusters are being updated till centroids' deviation from previous point gets into a specified range. The quality of the produced clusters is examined through Silhouette Evaluation.
We use the famous MNIST Dataset. It consists of 60.000 images, consisting of 784 Pixels represented with a number in range [0,255].
- Training Dataset : train-images-idx3-ubyte - contains all the images of the MNIST Dataset
- Query Dataset : t10k-images-idx3-ubyte - a subset of the initial dataset with 10.000 images in total
Once we have identified the candidates in the training set, we have to infer how similar they are to the query image. We do that by comparing the two vectors using a specific similarity metric. In the case of LSH we use the Manhattan Distance and Hamming Distance for Hypercube. When we execute a Range Searching approach, we still use those two metrics to infer similarity within the given range.
In order to compile the code, you simply open the command line and enter: make
. The compilation produces 3 different executable names with distinct
names, each one for the methods we described above. You can execute the implemented algorithms, as follows:
- LSH:
./lsh -d <input file> -q <query file> -k <number of hastables> -L <number of projection functions> -ο <output file> -Ν <number of nearest neighbours> -R <radius>
- Hypercube:
./cube -d <input file> -q <query file> -k <hypercube dimensions> -M <maximum check points> -probes <number of probes> -ο <output file> -Ν <number of nearest neighbours> -R <radius>
- Clustering:
./cluster -i <input file> -c <configuration file> -o <output file> -complete <optional> -m <clustering method: Classic / LSH / HyperCube>
We include a readME PDF file that contains an extensive description of the project set up, the inner workings of the respective algorithms and the design choices of the creators, namely us. Also, we include a description of the contents of each source file. This PDF is written in Greek!
Big thanks to my fellow collaborator: Anastasios Melidonis
Built as part of the course: Software Development for Algorithmic Problems , Winter of 2020. University of Athens, DiT.