# AnalyticBridge

A Data Science Central Community

# A useful classification technique - k-nearest neighbours

k-nearest neighbours is part of supervised learning that has been used in many applications in the field of data mining, statistical pattern recognition, image processing and many others. Some successful
applications are including recognition of handwriting, classifying the nearest
segment for customers etc.

ü  Features

o   All instances correspond to points in an n-dimensional Euclidean space

o   Classification is delayed till a new instance arrives

o   Classification done by comparing feature vectors of the different points

o   Target function may be discrete or real-valued

The k-nearest neighbours algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. K-nearest
neighbor is a supervised learning algorithm where the result of new instance
query is classified based on majority of K-nearest neighbor category. The purpose
of this algorithm is to classify a new object based on attributes and training
samples. The classifiers do not use any model to fit and only based on memory.
Given a query point, we find K number of objects or (training points) closest
to the query point. The classification is using majority vote among the
classification of the K objects. Any ties can be broken at random. K Nearest
neighbor algorithm used neighborhood classification as the prediction value of
the new query instance. k is a positive integer, typically small. If k = 1,
then the object is simply assigned to the class of its nearest neighbor.

K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take
simple majority of these K-nearest neighbors to be the prediction of the query
instance.

ü  An arbitrary instance is represented by           (a1(x), a2(x), a3(x),.., an(x))

o   ai(x) denotes features

ü  Euclidean distance between two instances

o   d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)

The similar method can be used for regression, by simply assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of
the neighbors, so that the nearer neighbors contribute more to the average than
the more distant ones. (A common weighting scheme is to give each neighbor a
weight of 1/d, where d is the distance to the neighbor. This
scheme is a generalization of linear interpolation.)

The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the
algorithm, though no explicit training step is required. The k-nearest
neighbor algorithm is sensitive to the local structure of the data.

Reference:

http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

KNN can be done using PROC DISCRIM in SAS also:

http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
http://www.sascommunity.org/wiki/%3D%3D_How_to_implement_K-Nearest-...

Views: 3232

Comment

Join AnalyticBridge

Comment by JouNa on March 3, 2014 at 1:27pm

hi  what you think if combined K Means and KNN ?

Comment by Minethedata on January 3, 2011 at 4:56am
hi anjanita, i think the k can be found out by cross-validation. as i understand 5-fold cross validation, i carry out modeling using k =2 for all combinations of the training data and test data. Similarly i carry out modeling using all values of k for all combinations of training and test data. After this the set with maximum accuracy is picked up and the k corresponding to it is selected. Please correct me if I am wrong here in understanding of the procedure.