Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...# K Nearest Neighbors and its Application in Python

## K Nearest Neighbors algorithm

The K Nearest Neighbors algorithm (KNN) is an elementary but important machine learning algorithm. KNN can be used for both classification and regression predictive problems. The reason for the popularity of K Nearest Neighbors can be attributed to its easy interpretation and low calculation time.

Suppose we have a set of many data points blue and red and we want to classify the a new black coloured point as either red or blue.

What the algorithm does, is find the 3 nearest neighbours (if K=3), and check the colour of these three nearest neighbours. As a majority of the three nearest neighbours are red the point is classified as red. The K parameter decides how many of the nearest neighbours have to be considered to determine the property of our unknown point.

You might be wondering how are these nearest neighbours found. This is where the mathematics comes in.

K Nearest Neighbors uses a similarity metric to determine the nearest neighbours. This similarity metric is more often than not the Euclidean distance between our unknown point and the other points in the dataset. The general formula for Euclidean distance is

where q1 to qn represent the attribute values for one observation and p1 to pn represent the attribute values for the other observation.

We divide the set into training set(for which the labels or values are known) and test set(for which we need to predict the values/labels. The test set is used for computing the accuracy of the model.

K Nearest Neighbors can be applied to both continuous and discrete data.

For discrete data, like the above example, it takes the label which has the majority among it’s K Nearest Neighbors. The following steps are involved:-

- Taking each entry in the test set.
- Finding its euclidean distance from each entry in the training set.
- Appending the calculated distance to a new column ‘distance’ in the training set.
- Randomly shuffling the resulting set.
- Sorting the set in ascending order of distance.
- Choosing the first 10 entries(if K=10) i.e. the five nearest neighbours.
- Finding the label(1 or 0) with the majority among those 10 entries.
- Appending the resultant predicted price to a new column ‘predicted label’ in the test set.
- Finding accuracy i.e. the percentage of labels in the test predicted correctly.

It works in a similar way for continuous data. For each entry in our test set we will iterate over all the entries in the training set and calculate the euclidean distance. Thus, we are calculating the euclidean distances between each entry in our test set and all entries in our training set. After the euclidean distance has been calculated we, make a ‘distance’ column in our training set, shuffle the resulting set and sort them in ascending order of distance. Choose the first 3(if K=3) or first 5(if K=5) entries of the sorted dataset and find the mean of their ‘price’ column. So in simpler terms we are-

- Taking each entry in the test set.
- Finding its euclidean distance from each entry in the training set.
- Appending the calculated distance to a new column ‘distance’ in the training set.
- Randomly shuffling the resulting set.
- Sorting the set in ascending order of distance.
- Choosing the first 5 entries(let K=5) i.e. the five nearest neighbours.
- Calculating the mean of their ‘value’ column which is the predicted value.
- Appending the resultant predicted value to a new column ‘predicted value’ in the test set.
- Finding the MAE(mean absolute error) for the test set.

This is how the euclidean distance is calculated between two observations:

Now, we are ready to understand this better with an application.

## Application of K Nearest Neighbors

### Predicting house rent using AirBnb database

AirBnB is a marketplace for short-term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Over the years, Airbnb has grown to become a popular alternative to hotels.

One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. This is where the dataset on all other listings of a particular place helps.The dataset for Washington D.C. can be downloaded from this link.

In this problem, we will assume that we have a house in Washington D.C. that we want to put up on Airbnb but we are facing a problem as to deciding its nightly rent. If we choose the rentas too high, then renters will find a cheaper alternative and if we rent it too low we might miss out on making a profit. K Nearest Neighbors can help us here.

We can find a few listings that are similar to ours, average the listed price for the ones most similar to ours, and set our listing price to this calculated average price.

Let’s begin.

### Reading and cleaning the dataset

Let us start by importing all the required libraries. sklearn library provides us with the KMeans implementation class. Pandas and numpy are used for dataframe manipulation.

```
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score,KFold
from sklearn.neighbors import KNeighborsRegressor
```

Let’s read in the dataset into a pandas dataframe and display the columns

```
dc_listings = pd.read_csv('listings.csv.gz')
dc_listings.columns
```

The dataset contains a lot of columns many of which are irrelevant to the price of the house.Columns such as city,city_code,country,latitude, longitude etc. don’t serve any useful information to us if we want to predict prices.

For predicting the rent what are the most common factors that come to head.

- The no. of people the house can accommodate.
- The no. of bedrooms it has.
- The no. of bathrooms it has.
- The no. of reviews which shows the credibility of the host.
- The no. of reviews per month

These 5 columns along with the column ‘price’ which we need to predict are kept in the dataframe and the rest are ommited.

```
dc_listings = dc_listings[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month','price',]]
dc_listings.head()
```

It looks like the prices are in ‘string’ or ‘object’ format. To convert it into the numeric format, we strip the \$ sign from each price.

If we look further, we will find some prices are like $1,000. So we also remove the ‘,’ from all the prices before finally converting the all the values in the column to ‘float’ type.

```
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace('$',''))
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace(',',''))
dc_listings['price'] = dc_listings['price'].astype('float')
dc_listings = dc_listings.dropna(axis=0)
```

Once, the price column is fixed let us check if there are any missing values in the dataframe.

`dc_listings.info()`

There are indeed some missing values, in the ‘bedrooms’,’bathrooms’, ‘review_scores_rating’ and ‘number_of_reviews’ columns. We drop the rows in the dataframe with any missing values.

```
dc_listings = dc_listings.dropna(axis=0)
dc_listings.info()
```

This is how our dataframe looks now

```
dc_listings.head()
```

As all the columns have attributes that are on different scales we will normalize them. For normalizing we subtract from each value in the column, the mean of that column and dvide by the standard deviation of that column.

```
normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
normalized_listings.head()
```

Now we are ready to use K Nearest Neighbors. We will choose K as 7. This means that for each house in our test set we will look at 7 similar houses in the training set to decide the price.

### Building the Knn model

```
knn = KNeighborsRegressor(n_neighbors = 7,algorithm = 'brute')
knn
```

We will shuffle the normalized listings set by using the ‘sample’ function and then choose 75 percent of the rows for our training set and the rest 25 percent for our test set.

```
normalized_listings = normalized_listings.sample(frac=1).reset_index(drop=True)
train_df = normalized_listings.iloc[0:int(0.75*len(normalized_listings))]
test_df = normalized_listings.iloc[int(0.75*len(normalized_listings)):]
```

We will now train our knn model using the training data and fit our test data onto it. We will specify ‘price’ as the target column which needs to be predicted.

```
train_features = train_df[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month']]
target = train_df['price']
knn.fit(train_features,target)
predictions = knn.predict(test_df[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month']])
test_df['predicted price'] = predictions
```

Let us now calculate the Mean Absolute Error(MAE) and the actual prices in the test set.

The reason we check the Mean Absolute Error(MAE) instead of the root Mean Square Error(RMSE) is that in RMSE, even small error of $10 is penalised to a larger extent. Thus MAE is a more true representation of our error

```
mae = np.absolute(test_df['price'] - test_df['predicted price']).sum()/test_df.shape[0]
print(mae)
```

Our prices are off by 38 dollars on an average which is not that big.

### Making Predictions

Let us assume we have a flat which can accommodate 4 people, has 2 bedrooms, and 2 bathrooms. Also, since it is a new flat, it does not have any number_of_reviews and reviews_per_month.

Let us store all our data into a pandas series in the same corresponding order as our dataset.

Then we call the predicted function on it.

```
our_flat = pd.DataFrame([], columns = ['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month'])
our_flat.loc[1,:] = [4,3,2,0,0]
predicted_price = knn.predict(our_flat)
print("We should list our flat at a nightly rate of $%d."%(predicted_price))
```