Implementing K-Nearest Neighbors in Your Machine Learning Model

Imagine we have a set of labeled and unlabeled data, and we want to build a classifier which takes the unlabeled data as input and labels that data as output.

With this kind of situation, we’ll need to build a classification model that will learn from already-labeled data (training data). Later we’ll use that model to predict our unlabeled data (test data).

This type of machine learning is called supervised learning, which we can define as feeding data into a machine learning algorithm.

In doing so, we’re actually showing that groups exist, and which data belong to which groups.

There are many supervised learning models. Examples include, Support Vector Machines (SVM), logistic regression, decision trees, factorization machines, random forests, and K-Nearest Neighbors (KNN) — which will be the focus of this article.

KNN is a non parametric technique, and in its classification it uses k, which is the number of its nearest neighbors, to classify data to its group membership. It primarily works by implementing the following steps.

First, it calculates the distance between all points. Second, it finds the k points that are closest based on the previously calculated distances. Finally, the class is chosen by the majority of the surrounding points.

K is a positive integer which varies. If you have k as 1, then it means that your model will be classified to the class of the single nearest neighbor.

The choice of k is very important in KNN because a larger k reduces noise. However, to choose an optimal k, you will use GridSearchCV, which is an exhaustive search over specified parameter values.

In the above plot, black and red points represent two different classes of data. We need to classify our blue point as either red or black. If k = 1, KNN will pick the nearest of all and it will automatically make a classification that the blue point belongs to the nearest class. If k > 1, then a vote by majority class will be used to classify the point.

We’re going to work through a practical example using Python’s scikit-learn. Therefore, we need to install pandas, which we’ll use while working with dataframes. We also need to install numpy, which will help us work with numpy arrays. Finally, we’ll install scikit-learn, which is a machine learning package in Python that helps us work with algorithms like KNN.

Building our KNN model

When using scikit-learn’s KNN classifier, we’re provided with a method KNeighborsClassifier() which takes 9 optional parameters. Let’s go through them one by one.

n_neighbors — This is an integer parameter that gives our algorithm the number of k to choose. By default k = 5, and in practice a better k is always between 3–10.
weights — Since the prediction is made based on the votes of the nearest points, all the other points in the dataset are completely ignored. This results in a discontinuous function. The best way to solve this is by introducing weights. If we don’t define weights, uniform will be automatically used. This works by weighting all points in each neighborhood equally. Distance is another option for weights, which uses a principle of closer neighbors having more influence than ones further away.
algorithm — auto is the default algorithm used in this method, but there are other options: kd_tree and ball_tree. Both of these algorithms help to execute fast nearest neighbor searches in KNN. The ultimate difference between them is that ball_tree works with more distance metrics than kd_tree.
Other method parameters include:

a). leaf_size — (default = 30) which is passed to kd_tree and ball_tree. This affects the speed of construction and query.

b). p — for power parameter for Minkowski metric if p=2 it is equivalent to using euclidean distance and if p=1 it is equivalent to using manhattan distance,

c). metric — which is the distance metric for the tree

d). metric params

e). n_jobs — which is the number of parallel jobs to run for neighbors search.

K-Nearest Neighbors case study

In our case study, we’re going to use two datasets to show how KNN can be used to create a model and later make a prediction based on the k-nearest neighbors of the test dataset. The first dataset we’re going to use is the commonly-used Iris dataset. This dataset has 150 instances, and each instance has a class of either setosa, versicolor, or virginica (types of flowers). Every type of flower has 50 instances.

The second case study will involve trying to build a KNN model to predict whether a person will be a defaulter or not in a credit scoring system. We’ll use two predictor variables (age, loan amount) and one target variable (default).

Using KNN to classify the Iris dataset

Let’s start with our first case study using the Iris dataset. First we’ll import our necessary packages for this project: scikit-learn datasets, model_selection & neighbors, and numpy. Scikit-learn datasets contain the Iris dataset. Scikit-learn also has a model selection method, which will help us prevent overfitting by assisting us in partitioning data to training and testing datasets. Scikit-learn also has a neighbors method, which gives us the ability to implement the KNN algorithm in Python.

import numpy as np
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
print(iris.data.shape)
print(iris.target.shape)

Scikit-learn’s Iris data set is already divided into iris.data and iris.target. Iris_data has 4 columns, and these are our prediction variables. They include sepal height, sepal width, petal height, and petal width. Iris_target has labels for each row — it can either be setosa, versicolor, or virginica. Below is a scatter plot to visualize the distribution of label points in a 2D graph, using petal length and width.

import seaborn as sns
iris = sns.load_dataset("iris")
iris["ID"] = iris.index
iris["ratio"] = iris["sepal_length"]/iris["sepal_width"]
sns.lmplot(x="petal_length", y="petal_width", data=iris, hue="species", fit_reg=False, legend=False)
plt.legend()
plt.show()

We can now split our data into training and test data using scikit-learn’s train_test_split function. Since we don’t have a large dataset, we’ll use 75:25 as a ratio of training to testing, which is scikit-learn’s default training:test ratio.

X_train, X_test, y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.25)

After splitting the data, we can now build our classifier using class sklearn.neighbors.KNeighborsClassifier(). We will pass our k as n_neighbors = 13, weights, and the type of algorithm to use.

neigh = neighbors.KNeighborsClassifier(n_neighbors=13, weights='uniform',algorithm='auto')

Finally, let’s fit our training data, then use that model to predict labels of our testing set. Using score(), we’re able to know the accuracy of our model in predicting our test data.

neigh = neigh.fit(X_train,y_train)
print(neigh.predict(X_test))
print(neigh.score(X_test,y_test))

Getting optimal parameters

In order to have more accurate predictions in your test data, you will need to have optimal parameters. This is obtained by using GridSearchCV, found in Scikit-learn model_selection.

from sklearn.model_selection import GridSearchCV

Lets create a dictionary for our parameter values: k_range is range for k in this case it will be ranging from 1 to 31. Weight variable is holding two weight options— uniform and distance.

k_range = range(1,31)
weight = ['uniform','distance']
param_grid = dict(n_neighbors=k_range, weights=weight)

Now let’s pass our parameter values to GridSearchCV, our classifier param_grid; cv, which is an integer to specify the number of folds; and scoring to evaluate predictions on the test set.

neigh = neighbors.KNeighborsClassifier()
grid = GridSearchCV(neigh,param_grid,cv=10,scoring='accuracy')
grid.fit(iris.data,iris.target)
Finally, lets print our best score and best parameters
print(grid.best_score_)
print(grid.best_params_)

Our results shows that the best score is 0.98 when k is set as 13 and weights as uniform.

Building a credit scoring model using KNN

In this case study, we’re going to classify whether a person of age 43 who borrowed a loan of $60,000 is going to repay the loan or default. Our labels are 1 for default and 0 for repay. First we’re going to create a numpy array with training data, with age and amount borrowed as our prediction variables and default as the label.

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
data = np.array([
[25,40000,0],
[35,60000,0],
[45,80000,0],
[20,20000,0],
[35,120000,0],
[52,18000,0],
[23,95000,1],
[40,62000,1],
[60,100000,1],
[48,220000,1],
[33,150000,1]
])

We can now convert our array to a pandas dataframe and separate our prediction variables with labels using pandas drop:

df = pd.DataFrame(data = data, columns=['age','amount','default'])
X = df.drop(['default'],axis=1)
y = df.drop(['age','amount'],axis=1)
Let's create our test data using numpy array:
test = np.array([
[43,60000]
])

Finally, we will create a KNN classifier and use it to classify our test data:

clf = KNeighborsClassifier(n_neighbors=5,weights='uniform').fit(X,y.values.ravel())
print(clf.predict(test))

Our results using KNN predict the person as a non-defaulter. This is simply because our model classified the person as having more similar characteristics or features (age and amount of loan) to those who didn’t default on their loan rather than those who defaulted.

Conclusion

KNN is an effective machine learning algorithm that can be used in credit scoring, prediction of cancer cells, image recognition, and many other applications. The main importance of using KNN is that it’s easy to implement and works well with small datasets.

However, KNN also has disadvantages. Specifically, it doesn’t work well with large datasets because for every test data, distance between all training data points and the test data in question is computed, resulting in large space and a long required timeframe.

Discuss this post on Hacker News.