Understand the Fundamentals of the K-Nearest Neighbors (KNN) Algorithm

An introduction to the famous machine learning algorithm KNN using scikit-learn

K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both regression and classification. Its operation can be compared to the following analogy:

To make a prediction, the KNN algorithm doesn’t calculate a predictive model from a training dataset like in logistic or linear regression. Indeed, KNN doesn’t need to build a predictive model. Thus, for KNN, there is no actual learning phase. That’s why it’s usually categorized as a lazy learning method.

To be able to make a prediction, a KNN uses the dataset to produce a result without any training phase.

KNNs are used in many field:

  • Computer Vision: let’s say you have an image and you want to find sets of images similar to that image, KNN is used — among many others — to find and predict the most prominent images that resemble the input image.
  • Marketing: imagine you have purchased a baby trolley on Amazon, next day you will receive an email from the e-commerce company that you might be interested on a set of pacifiers. Amazon can target you with similar products, or products that other customer that have the same buying habits as you.
  • Content recommendation: probably the most interesting example is Spotify and their legendary recommendation engine. KNN is used on many content recommendation engines, even though more powerful systems are now available, KNN is still relevant to this date.

How does KNN make a prediction?

To make a prediction, the KNN algorithm will use the entire dataset. Indeed, for an observation that isn’t part of the dataset and is the actual value we want to predict, the algorithm will look for the K instances of the dataset closest to our observation.

Then for these K neighbors, the algorithm will use their output in order to calculate the variable y of the observation that we want to predict.

In other words:

  • If KNN is used for a regression problem, the mean (or median) of the y variables of the K closest observations will be used for predictions.
  • If KNN is used for a classification problem, it’s the mode (the value that appears most often) of the variables y of the K closest observations that will be used for predictions.

Understand the algorithm

Input data:

  • A data set D.
  • A distance definition function d.
  • An integer K

Steps:

For a new observation X for which we want to predict its output variable y:

  1. Calculate all the distances of this observation X with the other observations of the dataset D
  2. Retain the K observations from the dataset D close to X using the distance calculation function d
  3. Take the values of y from the K observations retained:
    1. If a regression problem, calculate the mean (or median) of y deductions 2. If a classification problem, calculate the method of y deductions
  4. Return the value calculated in step 3 as the value that was predicted by KNN for observation X.

We can diagram the functioning of KNN by writing it in the following pseudo-code:

Similarity calculation in the KNN algorithm

As we have just seen in the pseudocode above, KNN needs a function to calculate the distance between two observations. The closer two points are to each other, the more similar they are, and vice versa.

There are several distance calculation functions, like the following:

We choose the distance function according to the types of data we’re handling. So for quantitative data (example: weight, wages, size, shopping cart amount, etc.) of the same type, Euclidean distance is a good candidate. Taxicab geometry is a good measure to use when the data (input variables) are not of the same type (example: age, sex, length, weight, etc.).

It can be a great algorithmic exercise to try to code them from scratch, but there are plenty of implementations available, and many libraries in Python, Swift, and Java that offer ML libraries that perform these calculations internally. You just have to indicate the distance measure you want to use.

How to choose the K value?

To select the value of K that suits your data, we run the KNN algorithm multiple times with different values ​​of K. Then we choose the K that reduces the number of errors encountered while maintaining the ability of the algorithm to make predictions with precision when it receives new data.

Here are a few things to keep in mind:

  • When we decrease the value of K to 1, our predictions become less stable. Imagine for a moment that we have K = 1 and that we have a query point surrounded by several red dots and a green dot (the top left corner of the Figure 1), but the green point is the closest neighbor. Logically, we would think that the query point is probably red, but because K = 1, the KNN algorithm incorrectly predicts that the query point is green.
  • Conversely, as we increase the value of K, our predictions become more and more stable due to majority or average voting. We are therefore more likely to make more precise predictions (up to a certain point). On the other hand, we’re starting to see an increasing number of errors. It’s at this point that we know that we’ve pushed the value of K too far.
  • In cases where we vote by the majority (for example, by choosing the mode in a classification problem) among the labels, we generally choose an odd number for K (in the event of a tie).

The choice of the K value varies, according to the given dataset. As a general rule, the fewer neighbors we use (a small number K), the more we’ll be subject to underfitting the model.

Furthermore, the more neighbors we use (a large number K), the more reliable our prediction will be. However, if we use K number of neighbors with K = N with N being the number of observations, we risk overfitting our model, leaving it unable to generalize well on observations it hasn’t yet seen.

The image above on the left represents points in a 2D plane with three possible types of labeling (red, green, blue). For the 5-NN classifier (K = 5), the boundaries between each region are fairly smooth and regular.

As for the N-NN Classifier, we note that the limits are “chaotic” and irregular. The latter comes from the fact that the algorithm tries to bring all the blue dots in the blue regions and the red dots in the red regions—this is a typical case of overfitting.

For this example, we’d prefer the 5-NN classifier over the N-NN classifier. This is because the 5-NN classifier generalizes better than its opponent.

Scikit-learn example

We’ll use sklearn’s built-in dataset called “breast cancer wisconsin dataset”, which has the following features:

Download the following packages:

Import the following functions:

Load and split the data:

The data is distributed unevenly, but that’s not a problem and can be correlated to the fact that some events are less likely to happen than others:

Iterate through 10 different values of K:

Inspect and choose the best K:

In the Figure 4, we notice that the maximum value is 0.965 and appears for K = {5, 6, 8}. We can conclude that the optimal K value is 5. Since we’re using the Minkowski distance, you can try to iterate other K values, but also try different distance functions, like Euclidean distance.

Limitations of KNN

KNN is a fairly simple algorithm to understand. This is mainly because it doesn’t need a model to be able to make a prediction. The counterpoint to this is that it must keep in memory all the observations to be able to make its prediction. So you have to pay attention to the size of the input dataset.

Also, the choice of the method for calculating the distance as well as the number of neighbors K may not be immediately obvious. You might have to try several combinations and tune the algorithm to get a satisfactory result for your use case.

Conclusion

I tried with this article to convey a simple explanation of the KNN algorithm— here’s what we learned:

  • KNN stores the entire dataset to make a prediction.
  • KNN does not calculate any predictive model and it is part of the lazy learning family of algorithms.
  • KNN makes predictions just in time (on the fly) by calculating the similarity between an input observation and the different observations in the dataset

Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at [email protected].

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square