Data Mining Note 3 - K Nearest Neighbors

This chapter talks about k Nearest Neighbors (kNN) as a metholodogy for both classification and regression problems. The kNN method serves a basic and easy to understand foundational machine learning and data mining technique. The kNN method is an excellent baseline machine learning technique, and also allows many extensions. It usually performs reasonable well or sometimes very well when compared to more sophisticated techniques.

kNN classification basically means the estimated class of a vector $\mathbf{x}$ is the most frequent class label in the neighborhood of $\mathbf{x}$. kNN classifiers are inherently naturally multi-class, and are used extensively in applications such as image processing, character recognition and general pattern recognition tasks. For kNN regression, the estimated response value of a vector $\mathbf{x}$ is the average of the response values in the neighborhood of $\mathbf{x}$.

Principle for kNN Classification: The reasonable class/category for a given object is the most prevalent class among its nearest neighbors.
Principle k-Nearest Neighbor Regression: The reasonable prediction of the response value for a given object is the average of the response values of its nearest neighbors.

General Steps for kNN
Comparison of Classification and Regression

Choose a distance for measuring how far a given point is from another
Set the size of the neighborhood k
Compute the distance from each existing point to the new point
Identify the class labels of the k points closest/nearest to the new point (classification)
Identify the response values of the k points closest/nearest to the new point (regression)
Assign the most frequent label to the new point (classification)
Compute the average of the response values of those k neighbors as the best estimate of the new point (regression)

Distance

First introduce some most commonly used distance below, which would be applied later in kNN.

Euclidean distance: also known as the $l_2$ distance
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) = \sqrt{\sum_{l=1}^q(x_{il}-x_{jl})^2} = ||\mathbf{x}_i-\mathbf{x}_j||_2 \end{align*}$$
Manhattan distance (city block): also known as $l_2$ distance
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) = \sum_{l=1}^q|x_{il}-x_{jl}| = ||\mathbf{x}_i-\mathbf{x}_j||_1 \end{align*}$$
Maximum distance: also known as the infinity distance
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) = \underset{l=1,...,q}{\text{max}}|x_{il}-x_{jl}| = ||\mathbf{x}_i-\mathbf{x}_j||_\infty \end{align*}$$
Minkowski distance: also known as $l_p$ distance
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) = \Big\{\sum_{l=1}^q|x_{il}-x_{jl}| \Big\}^{1/p} \end{align*}$$
Canberra distance:
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) =\sum_{l=1}^q \frac{ |x_{il}-x_{jl}| }{ |x_{il}+x_{jl}| } \end{align*}$$
Jaccard/Tanimoto distance: For binary vectors ie $\mathbf{x}_i\in {0,1}^q$
$$\begin{align*} d(\mathbf{x}_i,\mathbf{x}_j) =1- \frac{ \mathbf{x}_i\cdot \mathbf{x}_j }{ |\mathbf{x}_i|^2 + |\mathbf{x}_j|^2 - \mathbf{x}_i\cdot \mathbf{x}_j} \end{align*}$$

kNN Classification

Detailed Steps for kNN Classification

$\mathcal{D} = \Big\{(x_1, Y_1),…, (x_n, Y_n) \Big\}$, with $x_i\in \mathcal{X}^q, Y_i\in \{1,…,g\}$

Choose the value of k and the distance to be used
Let $\mathbf{x}^*$ be a new point. Compute $d(\mathbf{x}^*,\mathbf{x}_i) \quad i=1,2,…n $
Rank all the distances $d^∗_i$ in increasing order: $$\begin{align*} d^∗_{(1)} \leq d^∗_{(2)} \leq ...\leq d^∗_{(k)} \leq d^∗_{(k+1)} \leq ... d^∗_{(n)} \end{align*}$$
Form $\mathcal{V}_k(\mathbf{x}^∗)$, the k-Neighborhood of $\mathbf{x}^∗$$$\begin{align*} \mathcal{V}_k(\mathbf{x}^∗) = \Big\{ \mathbf{x}_i: d(\mathbf{x}^*,\mathbf{x}_i)\leq d^*_{(k)} \Big\} \end{align*}$$
Compute the predicted response $\hat{Y}^*$ as $$\begin{align*} \hat{Y}^*_{kNN} &= \text{Most frequent label in } \mathcal{V}_k(\mathbf{x}^∗) \\ &= \hat{f}^*_{kNN}(\mathbf{x}^*) = \underset{j\in\{1,...,g\}}{\texttt{argmax}} \Big\{ p_j^{(k)}(\mathbf{x}^*) \Big\}\\ \text{where } p_j^{(k)}(\mathbf{x}^*)= \frac{1}{k}\sum_{\mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) } I(Y_i=j) & \text{ estimates the probability that } \mathbf{x}^* \text{ belongs to class j based on} \mathcal{V}_k(\mathbf{x}^∗) \end{align*}$$

Note: [Posterior probability estimate] $ \frac{1}{k}\sum_{\mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) } I(Y_i=j) $ can be regarded as a rough estimate of $ \pi_j (\mathbf{x}^*) = Pr[Y^*=j| \mathbf{x}^* ] $ , the posterior probability of class membership of $ \mathbf{x}^* $

Comments:

kNearest Neighbors (kNN) essentially performs classification by voting for the most popular response among the k nearest neighbors of $\mathbf{x}^*$.
kNN provides the most basic form of nonparametric classification.
Since the fundamental building block of kNN is the distance measure, one can easily perform classification beyond the traditional setting where the predictors are numeric. For instance, classification with kNN can be readily performed on indicator attributes
$$ \mathbf{x}^* = (x_{i1},…, x_{ip} )^T \in \{ 0,1\}^p $$
kNN classifiers are inherently naturally multi-class, and are used in many applications.

kNN Regression

Detailed Steps for kNN Regression

$\mathcal{D} = \Big\{(x_1, Y_1),…, (x_n, Y_n) \Big\}$, with $x_i\in \mathcal{X}^q, Y_i\in \mathbb{R}$

Choose the value of k and the distance to be used
Let $\mathbf{x}^*$ be a new point. Compute $d(\mathbf{x}^*,\mathbf{x}_i) \quad i=1,2,…n $
Rank all the distances $d^∗_i$ in increasing order: $$\begin{align*} d^∗_{(1)} \leq d^∗_{(2)} \leq ...\leq d^∗_{(k)} \leq d^∗_{(k+1)} \leq ... d^∗_{(n)} \end{align*}$$
Form $\mathcal{V}_k(\mathbf{x}^∗)$, the k-Neighborhood of $\mathbf{x}^∗$$$\begin{align*} \mathcal{V}_k(\mathbf{x}^∗) = \Big\{ \mathbf{x}_i: d(\mathbf{x}^*,\mathbf{x}_i)\leq d^*_{(k)} \Big\} \end{align*}$$
Compute the predicted response $\hat{Y}^*$ as $$\begin{align*} \hat{Y}^*_{kNN} &= \hat{f}^*_{kNN}(\mathbf{x}^*)\\ &= \frac{1}{k}\sum_{\mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) } Y_i \\ &= \frac{1}{k}\sum_{i=1}^n Y_i I( \mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) ) \end{align*}$$

Comments:

kNearest Neighbors (kNN) essentially performs regression by averaging the responses of the nearest neighbors of $\mathbf{x}^*$.
kNN provides the most basic form of nonparametric regression
Since the fundamental building block of kNN is the distance measure, one can easily perform regression beyond the traditional setting where the predictors are numeric. For instance, Regression vectors of binary with kNN can be readily performed on indicator attributes
$$ \mathbf{x}^* = (x_{i1},…, x_{ip} )^T \in \{ 0,1\}^p $$
kNN somewhat performs smoothing (filtering).
The estimated response kNN for $\mathbf{x}^*$ is estimator of the average response which is the conditional expectation of Y given $\mathbf{x}^*$
$$ \hat{Y}^*_{kNN} =\mathbb{E}\widehat{[Y^*|\mathbf{x}^*]} $$

Basic kNN & Weighted kNN

Limitation of Basic kNN

Equidistance: All neighbors are given the same contribution to the estimate of the response; In the estimated probability
$$\begin{align*} p_j^{(k)}(\mathbf{x}^*) & = \frac{1}{k}\sum_{\mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) } I(Y_i=j) =\sum_{\mathbf{x}^* \in \mathcal{V}_k(\mathbf{x}^∗) } w_i I(Y_i=j) \\ \text{the weight } w_i = \frac{1}{k}=\texttt{constant }&\text{for all points in } \mathcal{V}_k(\mathbf{x}^∗) \text{ regardlessly of how far they are from } \mathbf{x}^∗ \end{align*}$$
No model, weak interpretability: There is no underlying model, therefore no interpretation of the response relative to the predictor variables. There is no training set, since all happens at prediction. For this reason, kNN is referred to as lazy method.
Computationally intensive: Predictions are computationally very intensive, due to the fact that for each new observation, the whole dataset must be traversed to compute the response

Extension: Weighted kNN

kNN classification can be improved by weighting the votes as a function of the distance from $\mathbf{x}^∗$. The weights are defined so as to preserve convexity $ \sum_{i=1}^k w_i = 1$. Some of the common weighting schemes include:

Exponential Decay:$$\begin{align*} w_i = \frac{e^{-d_i^*}}{\sum_{l=1}^k e^{-d_l^*} } \end{align*}$$
Inverse Distance:$$\begin{align*} w_i = \frac{ \frac{1}{1+d_i^*}}{\sum_{l=1}^k \frac{1}{1+d_l^*} } \end{align*}$$

Effect of kNN

Effect of k

k controls the complexity of the underlying classifier, with small k yielding very complex classifiers and large k yielding rather simple ones.
If k is small, the estimated class is determined based on very few neighbors, the resulting kNN classifier will have very low bias, but very high variance. Take the example of k=1, the decision boundary will perfectly separate the classes on the training set, but will perform poorly on the test set.
If k is large, the estimated class is determined based on very many neighbors from far and wide, the resulting kNN classifier will
have very large bias, but very low variance. In fact, for truly large k, the decision boundary will be a constant hyperplane.
The determination of an optimal k desires the trade-off between bias and variance:
Determine k by cross validation
Determine k by direct minimization of the estimated prediction error via a suitably chosen test set

Effect of n
As stated before, kNN is a lazy method, everything happens at prediction step. Thus the sample size n plays a crucial role, and large n would lead to intense prediction.

Effect of p
The dimensionality p of the input space is only felt by the function that computes the distances. If the function is optimized, kNN should be unaffected by this dimensionality

Effect of distance
Some distances are more robust to extreme observations.

Pros & Cons of kNN

Strength

The kNN method is intuitively appealing and very easy to understand, explain, program/code and interpret
The kNN method provides a decent estimate of $Pr[Y = j|x]$, the posterior probability of class membership
The kNN method easily handles missing values (by restricting distance calculations to subspace)
As the number of training samples grows larger, the asymptotic misclassification error rate is bounded by twice the Bayes risk.
$$ \underset{n\rightarrow \infty}{lim} R(\hat{f}_n^{(kNN)}) \leq 2R^* $$
The kNN method is naturally suitable for sequential/incremental machine learning.
The kNN method is also suitable where the hypothesis space is variable in size.
The kNN method can handle non-numeric data as long as the distance can be defined.
The kNN methods can handle mixed types of data as long as the distance are computed as hybrid or combinations.
The kNN method is inherently multi-class. This is very important because for some other methods, going beyond binary classification requires some sophisticated mathematics. It also handles very flexible decision boundaries.

Weakness

The computational complexity of kNN is very high in prediction. Specifically, it is $\mathcal{O}(nmp)$ where n is the training set size, m is the test set size and p is the number of predictor variables. This means that kNN requires large amount of memory, and therefore does NOT scale well. This failure in scalability is addressed using various heuristics and strategies.
kNN methods suffer from the Curse of Dimensionality (COD). When p is large and n is small (short fat data/ dimension of the space very high), the concept of nearness becomes meaningless to the point of being ill-defined, because the ”neighborhood” becomes very large. Thus, the nearest neighbor could be very far when the space is high dimensional and there are very few observations.
The kNN method does not yield a model, and therefore no parameter to help explain why the method performs as it does.
The kNN method is heavily affected by the local structure, and it is very sensitive to both irrelevant and correlated features. Unless the distance is well chosen and properly calibrated, kNN methods will be sensitive to outliers and all sorts of noise in the data.
Unless the distance is used in some way to weight the neighbors, more frequent classes will dominate in the determination of the estimated label. This means one has to be careful with kNN when one class has a far larger proportion of observations than the others.
The measurement scale of each variable affect the kNN method more than most methods. (measurement scale: standardizing, unitizing or cubizing/squeezing the data).

Application of kNN

Handwritten Digit Recognition is usually the first task in some Data Analytics competitions.
Text Mining and specific topic of text categorization/classification has made successful use of kNearest Neighbors approach
Credit Scoring is another application that has been connected with k Nearest Neighbors Classification
Disease Diagnostics also has been tackled using k Nearest Neighbors Classifiers

Distance

kNN Classification

Detailed Steps for kNN Classification

Comments:

kNN Regression

Detailed Steps for kNN Regression

Comments:

Basic kNN & Weighted kNN

Limitation of Basic kNN

Extension: Weighted kNN

Effect of kNN

Effect of k

Effect of n

Effect of p

Effect of distance

Pros & Cons of kNN

Strength

Weakness

Application of kNN