Attend FREE Webinar on Digital Marketing for Career & Business Growth Register Now

Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...

K Nearest Neighbors and its Application in Python

    -  

5 (100%) 4 votes

K Nearest Neighbors algorithm

The K Nearest Neighbors algorithm (KNN) is an elementary but important machine learning algorithm. KNN can be used for both classification and regression predictive problems. The reason for the popularity of K Nearest Neighbors can be attributed to its easy interpretation and low calculation time.

Suppose we have a set of many data points blue and red and we want to classify the a new black coloured point as either red or blue.

What the algorithm does, is find the 3 nearest neighbours (if K=3), and check the colour of these three nearest neighbours. As a majority of the three nearest neighbours are red the point is classified as red. The K parameter decides how many of the nearest neighbours have to be considered to determine the property of our unknown point.

You might be wondering how are these nearest neighbours found. This is where the mathematics comes in.

K Nearest Neighbors uses a similarity metric to determine the nearest neighbours. This similarity metric is more often than not the Euclidean distance between our unknown point and the other points in the dataset. The general formula for Euclidean distance is


where q1 to qn represent the attribute values for one observation and p1 to pn represent the attribute values for the other observation.

We divide the set into training set(for which the labels or values are known) and test set(for which we need to predict the values/labels. The test set is used for computing the accuracy of the model.

K Nearest Neighbors can be applied to both continuous and discrete data.

For discrete data, like the above example, it takes the label which has the majority among it’s K Nearest Neighbors. The following steps are involved:-

  1. Taking each entry in the test set.
  2. Finding its euclidean distance from each entry in the training set.
  3. Appending the calculated distance to a new column ‘distance’ in the training set.
  4. Randomly shuffling the resulting set.
  5. Sorting the set in ascending order of distance.
  6. Choosing the first 10 entries(if K=10) i.e. the five nearest neighbours.
  7. Finding the label(1 or 0) with the majority among those 10 entries.
  8. Appending the resultant predicted price to a new column ‘predicted label’ in the test set.
  9. Finding accuracy i.e. the percentage of labels in the test predicted correctly.

It works in a similar way for continuous data. For each entry in our test set we will iterate over all the entries in the training set and calculate the euclidean distance. Thus, we are calculating the euclidean distances between each entry in our test set and all entries in our training set. After the euclidean distance has been calculated we, make a ‘distance’ column in our training set, shuffle the resulting set and sort them in ascending order of distance. Choose the first 3(if K=3) or first 5(if K=5) entries of the sorted dataset and find the mean of their ‘price’ column. So in simpler terms we are-

  1. Taking each entry in the test set.
  2. Finding its euclidean distance from each entry in the training set.
  3. Appending the calculated distance to a new column ‘distance’ in the training set.
  4. Randomly shuffling the resulting set.
  5. Sorting the set in ascending order of distance.
  6. Choosing the first 5 entries(let K=5) i.e. the five nearest neighbours.
  7. Calculating the mean of their ‘value’ column which is the predicted value.
  8. Appending the resultant predicted value to a new column ‘predicted value’ in the test set.
  9. Finding the MAE(mean absolute error) for the test set.

This is how the euclidean distance is calculated between two observations: 

Now, we are ready to understand this better with an application.

Application of K Nearest Neighbors

Predicting house rent using AirBnb database

AirBnB is a marketplace for short-term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Over the years, Airbnb has grown to become a popular alternative to hotels.

One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. This is where the dataset on all other listings of a particular place helps.The dataset for Washington D.C. can be downloaded from this link.

In this problem, we will assume that we have a house in Washington D.C. that we want to put up on Airbnb but we are facing a problem as to deciding its nightly rent. If we choose the rentas too high, then renters will find a cheaper alternative and if we rent it too low we might miss out on making a profit. K Nearest Neighbors can help us here.

We can find a few listings that are similar to ours, average the listed price for the ones most similar to ours,  and set our listing price to this calculated average price.

Let’s begin.

Data Analytics Course by Digital Vidya

Free Data Analytics Webinar

Date: 24th May, 2018 (Thursday)
Time: 3 PM to 4 PM (IST/GMT +5:30)

Reading and cleaning the dataset

Let us start by importing all the required libraries. sklearn library provides us with the KMeans implementation class. Pandas and numpy are used for dataframe manipulation.

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score,KFold
from sklearn.neighbors import KNeighborsRegressor

Let’s read in the dataset into a pandas dataframe and display the columns

In [309]:
dc_listings = pd.read_csv('listings.csv.gz')
dc_listings.columns
Out[309]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

The dataset contains a lot of columns many of which are irrelevant to the price of the house.Columns such as city,city_code,country,latitude, longitude etc. don’t serve any useful information to us if we want to predict prices.

For predicting the rent what are the most common factors that come to head.

  1. The no. of people the house can accommodate.
  2. The no. of bedrooms it has.
  3. The no. of bathrooms it has.
  4. The no. of reviews which shows the credibility of the host.
  5. The no. of reviews per month
    These 5 columns along with the column ‘price’ which we need to predict are kept in the dataframe and the rest are ommited.
In [310]:
dc_listings = dc_listings[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month','price',]]
dc_listings.head()
Out[310]:
accommodates bedrooms bathrooms number_of_reviews reviews_per_month price
0 4 1.0 1.0 0 NaN $160.00
1 6 3.0 3.0 65 2.11 $350.00
2 1 1.0 2.0 1 1.00 $50.00
3 2 1.0 1.0 0 NaN $95.00
4 4 1.0 1.0 0 NaN $50.00

It looks like the prices are in ‘string’ or ‘object’ format. To convert it into the numeric format, we strip the \$ sign from each price.

If we look further, we will find some prices are like $1,000. So we also remove the ‘,’ from all the prices before finally converting the all the values in the column to ‘float’ type.

In [311]:
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace('$',''))
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace(',',''))
dc_listings['price'] = dc_listings['price'].astype('float')
dc_listings = dc_listings.dropna(axis=0)

Once, the price column is fixed let us check if there are any missing values in the dataframe.

In [312]:
dc_listings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2858 entries, 1 to 3722
Data columns (total 6 columns):
accommodates         2858 non-null int64
bedrooms             2858 non-null float64
bathrooms            2858 non-null float64
number_of_reviews    2858 non-null int64
reviews_per_month    2858 non-null float64
price                2858 non-null float64
dtypes: float64(4), int64(2)
memory usage: 156.3 KB

There are indeed some missing values, in the ‘bedrooms’,’bathrooms’, ‘review_scores_rating’ and ‘number_of_reviews’ columns. We drop the rows in the dataframe with any missing values.

In [313]:
dc_listings = dc_listings.dropna(axis=0)
dc_listings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2858 entries, 1 to 3722
Data columns (total 6 columns):
accommodates         2858 non-null int64
bedrooms             2858 non-null float64
bathrooms            2858 non-null float64
number_of_reviews    2858 non-null int64
reviews_per_month    2858 non-null float64
price                2858 non-null float64
dtypes: float64(4), int64(2)
memory usage: 156.3 KB

This is how our dataframe looks now

In [314]:
dc_listings.head()
Out[314]:
accommodates bedrooms bathrooms number_of_reviews reviews_per_month price
1 6 3.0 3.0 65 2.11 350.0
2 1 1.0 2.0 1 1.00 50.0
8 2 1.0 1.5 1 1.00 38.0
10 4 2.0 1.5 5 0.22 97.0
11 1 1.0 1.0 1 1.00 55.0

As all the columns have attributes that are on different scales we will normalize them. For normalizing we subtract from each value in the column, the mean of that column and dvide by the standard deviation of that column.

In [315]:
normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
normalized_listings.head()
Out[315]:
accommodates bedrooms bathrooms number_of_reviews reviews_per_month price
1 1.517219 2.332495 3.286747 1.431422 0.169042 350.0
2 -1.154178 -0.226043 1.422583 -0.577927 -0.457978 50.0
8 -0.619899 -0.226043 0.490501 -0.577927 -0.457978 38.0
10 0.448660 1.053226 0.490501 -0.452343 -0.898586 97.0
11 -1.154178 -0.226043 -0.441581 -0.577927 -0.457978 55.0

Now we are ready to use K Nearest Neighbors. We will choose K as 7. This means that for each house in our test set we will look at 7 similar houses in the training set to decide the price.

Building the Knn model

In [316]:
knn = KNeighborsRegressor(n_neighbors = 7,algorithm = 'brute')
knn
Out[316]:
KNeighborsRegressor(algorithm='brute', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=7, p=2,
          weights='uniform')

We will shuffle the normalized listings set by using the ‘sample’ function and then choose 75 percent of the rows for our training set and the rest 25 percent for our test set.

In [317]:
normalized_listings = normalized_listings.sample(frac=1).reset_index(drop=True)
train_df = normalized_listings.iloc[0:int(0.75*len(normalized_listings))]
test_df = normalized_listings.iloc[int(0.75*len(normalized_listings)):]

We will now train our knn model using the training data and fit our test data onto it. We will specify ‘price’ as the target column which needs to be predicted.

In [318]:
train_features = train_df[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month']]
target = train_df['price']
knn.fit(train_features,target)
predictions = knn.predict(test_df[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month']])
test_df['predicted price'] = predictions

Let us now calculate the Mean Absolute Error(MAE) and the actual prices in the test set.

The reason we check the Mean Absolute Error(MAE) instead of the root Mean Square Error(RMSE) is that in RMSE, even small error of $10 is penalised to a larger extent. Thus MAE is a more true representation of our error

In [319]:
mae = np.absolute(test_df['price'] - test_df['predicted price']).sum()/test_df.shape[0]
print(mae)
38.90129870129867

Our prices are off by 38 dollars on an average which is not that big.

Making Predictions

Let us assume we have a flat which can accommodate 4 people, has 2 bedrooms, and 2 bathrooms. Also, since it is a new flat, it does not have any number_of_reviews and reviews_per_month.
Let us store all our data into a pandas series in the same corresponding order as our dataset.
Then we call the predicted function on it.

In [320]:
our_flat = pd.DataFrame([], columns = ['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','reviews_per_month'])
our_flat.loc[1,:] = [4,3,2,0,0]
predicted_price = knn.predict(our_flat)
print("We should list our flat at a nightly rate of $%d."%(predicted_price))
We should list our flat at a nightly rate of $309.
In this post, we learned about an important and effective algorithm. You can apply K Nearest Neighbors for both classification and regression problems. You will learn more about this technique if you practice and implement it on your own. Happy learning.

I am a college student and a data science enthusiast. Connect with me on LinkedIn -www.linkedin.com/in/saksham-malhotra-9bb69513b

  • Data-Analytics

  • Your Comment

    Your email address will not be published.