KNN Algorithm Understanding concept
We all know the very basic idea of machine learning is we are given with data and using that data we build a model which will be used to predict the output for some new data. Machine learning is all about prediction so a prediction can be right our wrong. In machine learning, we focus on how to predict to make it correct up to some extent. For training data in Machine Learning, we have different algorithms and these algorithms are made for different types of data. According to the data given we select the most suitable algorithm for training the data. The accuracy of our model largely depends on the selected algorithm. So here in machine learning, we are focused on knowing which algorithm will be used for which kind of dataset.
Once you know to select the algorithms for solving the problem you can think that you have known the ABC’s of Machine learning.
Let’s not waste time and come to our today’s topic that’s Knn Algorithm.
KNN algorithm is used to train the dataset that is Labeled, noise-free, and the last one your dataset should be small because KNN is ‘lazy-learner’ i.e. doesn’t learn a discrimination function from the training set.
KNN — K Nearest Neighbors, is one of the simplest supervised machine learning algorithms mostly used for classification problem. KNN stores all available cases and classifies new case based on similarity measures.
k in KNN is a parameter that refers to the number of nearest neighbours to include in the majority voting process.
Let’s make things clear with an example suppose we are given with a dataset of drinks with the content of sulphur dioxide level and chloride level and based on that they are classified as Red or White.
The graph for sulphur dioxide level and chloride level is given below.
We can see the drink with a higher content of sulphur and lower of chloride is classified as white and vice-versa.
Now we are given a new drink we have to tell its colour.
here we will use the KNN algorithm to solve the problem for that we need to decide the factor k. Suppose this time we took it 5.
Then we need to find 5 closest neighbours of that new point and out of it, we see the characteristic of the majority one that would be the characteristic of the new point. So the new point is classified as a red drink since 4 out of 5 neighbours are red.
How do we choose the factor ‘k’?
KNN algorithm is based on feature similarity: Choosing the right value of k is a process called parameter tuning, and is important for better accuracy.
Different value of k can give a different result for the dataset. The example of that is shown below.
Lets now talk of how to find the value of k for the given dataset.
The value of k is sqrt(n), where n is the total no of data points and should always be odd to avoid confusion between two classes of data.
How does the KNN algorithm work?
Consider a dataset having two variables: height(cm) and weight(kg) and each point is classified as normal or underweight. The dataset is given below.
The graphical representation is given below.
Based on the given data, we have to classify the point(57kg, 170cm) as normal or underweight using KNN.
To find the nearest neighbours, we will calculate the Euclidean distance.
According to the Euclidean distance formula, the distance between two points in a plane with coordinates (x,y) and (a,b) is given by:
dist(d)=sqrt((x-a)*2+(y-b)*2)
Let’s calculate it to understand clearly.
here there are 9 data points so the value of k will be 3.
From the calculated Euclidean distance we can see that the all the 3 closest neighbours are classified as normal so the point (57,170) is also classified as normal.
This was all about the KNN algorithm we will see its implementation in the next blog.
Till then,
Happy Reading