KNN

KNN algorithm implementation in Python

Understanding the Algorithm

The K-Nearest Neighbours (KNN) algorithms is a data classification methods. It comes from the assumption that an individual will most likely look like its closest individuals. We can summmarize like this : Birds of the same feather flock together".

From a mathematical perspective, we'll try to find, for a given point, the individuals with the minimal distance (eq. his neighbors). The factor K comes into play here as it represents the number of individuals that we'll select. For example, if K= 3, we'll select the 3 nearest neighbors of our individual. Finally, the predicted class will be the class with the most occurence in the neighbors list.

If you want more details, you can look at this page.

Implementation

This classification Algorithm was used during a competition in my school to predict unknown data. To avoid disruption linked to scale of our data, I decided to perform a normalization. The negative side is that we put every variable on equal terms whereas some might be more significant than others.

In order to implement our algorithm, I also needed to evaluate the distance between two points. Several methods exist like the Manhattan distance or the Tchebychev distance but I have selected the Euclidian distance for its performance rate. Then, I simply assign the most represented class to the data that we are predicting.

def distance_euclidienne(indData,indTest):
  dist_ind=0
  for j in range(len(indTest-1)):
    dist_ind+=sqrt((indTest[j]-indData[j])**2)
  return dist_ind

def knn(indTest,data,k):
  data["distance"] = [distance_euclidienne(data.iloc[i,:],indTest) for i in range(len(data))]
  return data.sort_values(by="distance")[:k]

def prediction_classification(voisins_proches):
  return max(set(voisins_proches["Class"].tolist()),key=voisins_proches["Class"].tolist().count)

Assessing our Model

I created a confusion matrix to understand how successful my methods were on my dataset.

The KNN algorithms is a non-parametric method which means that the only unknown parameter is K. In order to find the best value I plotted how efficient was my algorithm depending on the number k to optimize it. The best option seems to be K=4.

Note

The report is in french but do not hesitate to contact me about this projetct whether in english or french.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Debes_Hugo_Knn.ipynb.txt		Debes_Hugo_Knn.ipynb.txt
Debes_Hugo_Knn.py.txt		Debes_Hugo_Knn.py.txt
LICENSE		LICENSE
README.md		README.md
Rapport Hugo DEBES.pdf		Rapport Hugo DEBES.pdf
data.csv		data.csv
finalTest.csv		finalTest.csv
preTest.csv		preTest.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KNN

Understanding the Algorithm

Implementation

Assessing our Model

Note

About

Uh oh!

Releases

Packages

License

hugodebes/KNN

Folders and files

Latest commit

History

Repository files navigation

KNN

Understanding the Algorithm

Implementation

Assessing our Model

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages