Open In App

How to choose the right distance metric in KNN?

Last Updated : 15 Jan, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Choosing the right distance metric is crucial for K-Nearest Neighbors (KNN) algorithm used for classification and regression tasks. Distance metric determines how the algorithm measures proximity between data points, directly impacting model accuracy and performance i.e to find these nearest neighbors. The most common distance metrics include:

Here’s a brief overview of each of them:

1. Euclidean Distance : Distance Metric in KNN

Euclidean distance is the most commonly used metric and is set as the default in many libraries, including Python's Scikit-learn. It measures the straight-line distance between two points in a multi-dimensional space.

\textbf{Euclidean Distance:}d_{\text{Euclidean}}(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}

where \mathbf{p} = (p_1, p_2, \dots, p_n) and \mathbf{q} = (q_1, q_2, \dots, q_n) are two points in n-dimensional space.

2. Manhattan Distance (L1 Norm)

Manhattan distance, also known as the taxicab or city block distance, measures the distance traveled along the grid-like streets of a city. It is the sum of the absolute differences between the corresponding coordinates of two points.

\textbf{Manhattan Distance (L1 Norm):} d_{\text{Manhattan}}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i - q_i|

where \mathbf{p} = (p_1, p_2, \dots, p_n) and \mathbf{q} = (q_1, q_2, \dots, q_n).

3. Minkowski Distance

Minkowski distance is a generalized form that can be adjusted to give different distances based on the value of 'p'. When p=1, it becomes Manhattan distance, and when p=2, it becomes Euclidean distance.

\textbf{Minkowski Distance:} d_{\text{Minkowski}}(\mathbf{p}, \mathbf{q}, p) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{1/p}

where p is a parameter, and \mathbf{p} = (p_1, p_2, \dots, p_n) and \mathbf{q} = (q_1, q_2, \dots, q_n) .

4. Chebyshev Distance (Maximum Norm)

Chebyshev distance calculates the maximum absolute difference along any dimension. It is useful in scenarios where the maximum difference is critical.

\textbf{Chebyshev Distance (Maximum Norm):} d_{\text{Chebyshev}}(\mathbf{p}, \mathbf{q}) = \max_{i} |p_i - q_i|

where \mathbf{p} = (p_1, p_2, \dots, p_n) and \mathbf{q} = (q_1, q_2, \dots, q_n).

5. Cosine Similarity

Cosine Distance measures the similarity between two vectors based on the cosine of the angle between them, with values ranging from 0 (highly similar) to 1 (completely different). It's commonly used in text analytics to compare document similarity by word frequency. The formula is:

\cos \theta = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}

Using this formula we will get a value which tells us about the similarity between the two vectors and 1-cosΘ will give us their cosine distance.

Let's break down the distance metrics and illustrate all:

KNN_DISTANCES
Visualization of each of the metrics individually

The image illustrates various distance metrics between two points, A and B on a 2D coordinate plane. The Euclidean distance is the straight line, while Manhattan is the sum of horizontal and vertical movements. Minkowski (p=3) is a generalization of Euclidean, and Chebyshev is the maximum distance along either axis.

Choosing the Right Distance Metric in KNN

Distance MetricWhen to UseUse Case Scenario
Euclidean Distance- Continuous numerical data.
- When the data is well-scaled.
- Predicting house prices based on square footage and number of bedrooms.
- Image recognition where pixel values are continuous features.
Manhattan Distance- Data with features on a grid (e.g., city streets).
- When data is less sensitive to outliers.
- Delivery routing for trucks following city grids.
- Robot navigation through a grid with restricted movement (i.e., only vertical or horizontal).
- Infrastructure planning and transportation networks.
Minkowski Distance- When you need a flexible metric that can represent different distances.
- When you want to tune the parameter 'p' for customization.
- Analyzing weather data like temperature, humidity, and wind speed to predict likelihood of rain.
- Choosing between Euclidean or Manhattan depending on the problem's spatial relationship.[1][5]
Chebyshev Distance- When the maximum difference between coordinates is important.
- When features represent movements along a grid with equal importance.
- In a board game, measuring the maximum number of moves a piece can make in any direction.
- Robot movement where diagonal and straight moves are equally important (e.g., chess, checkers).
Cosine Similarity- When the direction of the vectors is more important than their magnitude.- Text analysis, image retrieval, and recommendation systems. [Note: Cosine similarity is not a distance metric but a similarity measure, yet it is often used in similar contexts. ]



Similar Reads