Air Bnb Data Cleaning

18 minute read

#importing relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import re
import string
from scipy import stats

Here is the Link to Data

# Reading Data 
df = pd.read_csv('singapore_airbnb_dirty_data.csv')
#Understanding the data 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7907 entries, 0 to 7906
Data columns (total 16 columns):
id                                7895 non-null object
name                              7863 non-null object
host_id                           7907 non-null object
host_name                         7907 non-null object
neighbourhood_group               7841 non-null object
neighbourhood                     7870 non-null object
latitude                          7907 non-null float64
longitude                         7907 non-null float64
room_type                         7890 non-null object
price                             7907 non-null int64
minimum_nights                    7891 non-null object
number_of_reviews                 7907 non-null int64
last_review                       5149 non-null object
reviews_per_month                 5149 non-null float64
calculated_host_listings_count    7907 non-null int64
availability_365                  7521 non-null float64
dtypes: float64(4), int64(3), object(9)
memory usage: 988.5+ KB
#Some of the columns should be category type. Most of them are object type as data is dirty and inconsistent
#Columns to be converted to Category: neighbourhood_group, room_type
#Missing values in Each column
print('Column Name' + '\t\t\t' + 'Null Values')
df.apply(lambda x: sum(x.isnull()),axis=0)
Column Name			Null Values





id                                  12
name                                44
host_id                              0
host_name                            0
neighbourhood_group                 66
neighbourhood                       37
latitude                             0
longitude                            0
room_type                           17
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64
#Checking Unique Identifiers in our Data
print('Column Name' + '\t\t\t' + 'IsUnique')
df.apply(lambda x: x.is_unique,axis=0)
Column Name			IsUnique





id                                False
name                              False
host_id                           False
host_name                         False
neighbourhood_group               False
neighbourhood                     False
latitude                          False
longitude                         False
room_type                         False
price                             False
minimum_nights                    False
number_of_reviews                 False
last_review                       False
reviews_per_month                 False
calculated_host_listings_count    False
availability_365                  False
dtype: bool
#Handling ID Column
#The above output shows that ID is not unique. Hence, it can be dropped
df = df.drop('id',axis=1)
#Improvement 1 - ID column Dropped
df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                 66
neighbourhood                       37
latitude                             0
longitude                            0
room_type                           17
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Neighbourhood Column

#Neighbourhood group values for records that are missing neighbourhood
print(df[df['neighbourhood'].isnull()]['neighbourhood_group'].value_counts())
Central Region       31
East Region           3
North-East Region     2
North Region          1
Name: neighbourhood_group, dtype: int64
#Looking for neighbourhood value that occurs the most to be used as value in place of null group values
print(df[df['neighbourhood_group'] == 'Central Region']['neighbourhood'].value_counts())
Kallang             1034
Geylang              972
Rochor               531
Novena               529
Outram               472
Bukit Merah          463
Downtown Core        426
River Valley         357
Queenstown           260
Tanglin              206
Singapore River      174
Marine Parade        171
Orchard              135
Newton               133
Bukit Timah          127
Toa Payoh             99
Museum                62
Bishan                56
Southern Islands      16
Marina South           1
Name: neighbourhood, dtype: int64
#Replacing Central Region Neighbourhood values with Kallang
cond = (df['neighbourhood_group'] == 'Central Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Kallang'
print(df[df['neighbourhood_group'] == 'East Region']['neighbourhood'].value_counts())
Bedok        369
Pasir Ris     71
Tampines      63
Name: neighbourhood, dtype: int64
#Replacing East Region Neighbourhood values with Bedok
cond = (df['neighbourhood_group'] == 'East Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Bedok'
print(df[df['neighbourhood_group'] == 'North-East Region']['neighbourhood'].value_counts())
Hougang       108
Serangoon      68
Sengkang       67
Ang Mo Kio     56
Punggol        42
Name: neighbourhood, dtype: int64
#Replacing NE Region Neighbourhood values with Hougang
cond = (df['neighbourhood_group'] == 'North-East Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Hougang'
print(df[df['neighbourhood_group'] == 'North Region']['neighbourhood'].value_counts())
Woodlands                  66
Yishun                     52
Sembawang                  40
Central Water Catchment    34
Sungei Kadut                5
Mandai                      3
Lim Chu Kang                1
Name: neighbourhood, dtype: int64
#Replacing North Region Neighbourhood values with Woodlands
cond = (df['neighbourhood_group'] == 'North Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Woodlands'
#Neighbourhood Column is now cleaned
df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                 66
neighbourhood                        0
latitude                             0
longitude                            0
room_type                           17
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Neighbourhood Group Column

print(df[df['neighbourhood_group'].isnull()]['neighbourhood'].value_counts())
Geylang             12
River Valley         5
Rochor               4
Bukit Merah          4
Tanglin              4
Novena               4
Kallang              3
Bukit Timah          3
Queenstown           3
Outram               3
Downtown Core        2
Toa Payoh            2
Bedok                2
Newton               1
Jurong West          1
Ang Mo Kio           1
Bukit Panjang        1
Serangoon            1
Yishun               1
Choa Chu Kang        1
Bukit Batok          1
Bishan               1
Hougang              1
Singapore River      1
Sembawang            1
Orchard              1
Southern Islands     1
Clementi             1
Name: neighbourhood, dtype: int64
df['neighbourhood_group'].value_counts()
Central Region       6255
West Region           535
East Region           506
North-East Region     343
North Region          202
Name: neighbourhood_group, dtype: int64
#Code to replace the missing neighbourhood group values
x = df['neighbourhood'].unique()
for val in x:
    i = str(df[df['neighbourhood'] == val]['neighbourhood_group'].value_counts())
    c  = re.match('\w+(?:-\w+)+\s.[a-z]*|.[a-z]*\s.[a-z]*',i)
    cond = (df['neighbourhood'] == val) & (df['neighbourhood_group'].isnull())
    df.loc[cond, 'neighbourhood_group'] = c[0]
    
df['neighbourhood_group'].value_counts()
Central Region       6309
West Region           540
East Region           508
North-East Region     346
North Region          204
Name: neighbourhood_group, dtype: int64
#Neighbourhood Group column is cleaned
df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                           17
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Room Type Column

df['room_type'].value_counts()
Entire home/apt    4126
Private room       3368
Shared room         392
2                     1
4                     1
1                     1
3                     1
Name: room_type, dtype: int64
#There are data entry errors in Room type column. We dont know what this encoding means so we will drop these values as 
#they are small in number.
df = df[~df['room_type'].isin(['1','2','3','4'])]
df['room_type'].value_counts()
Entire home/apt    4126
Private room       3368
Shared room         392
Name: room_type, dtype: int64
df = df.astype({"room_type":'category'}) 
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7903 entries, 4 to 7906
Data columns (total 15 columns):
name                              7859 non-null object
host_id                           7903 non-null object
host_name                         7903 non-null object
neighbourhood_group               7903 non-null object
neighbourhood                     7903 non-null object
latitude                          7903 non-null float64
longitude                         7903 non-null float64
room_type                         7886 non-null category
price                             7903 non-null int64
minimum_nights                    7887 non-null object
number_of_reviews                 7903 non-null int64
last_review                       5145 non-null object
reviews_per_month                 5145 non-null float64
calculated_host_listings_count    7903 non-null int64
availability_365                  7517 non-null float64
dtypes: category(1), float64(4), int64(3), object(7)
memory usage: 934.0+ KB
df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                           17
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64
#There are some errors in the data entry that need to be removed
df['neighbourhood_group'].value_counts()
Central Region       6308
West Region           540
East Region           507
North-East Region     346
North Region          202
Name: neighbourhood_group, dtype: int64
df = df.astype({"neighbourhood_group":'category', "room_type":'category'}) 
df.groupby(['neighbourhood_group', 'room_type']).size()
neighbourhood_group  room_type      
Central Region       Entire home/apt    3733
                     Private room       2217
                     Shared room         346
East Region          Entire home/apt     130
                     Private room        364
                     Shared room          11
North Region         Entire home/apt      53
                     Private room        140
                     Shared room           9
North-East Region    Entire home/apt      64
                     Private room        270
                     Shared room          10
West Region          Entire home/apt     146
                     Private room        377
                     Shared room          16
dtype: int64
#We will cater the missing room type values based on the neighbourhood group of the records
df[df["room_type"].isnull()]['neighbourhood_group']
2252       Central Region
2253    North-East Region
2254          East Region
2255       Central Region
2256    North-East Region
2257       Central Region
2258       Central Region
2259       Central Region
7560       Central Region
7561          West Region
7562       Central Region
7563       Central Region
7564       Central Region
7565       Central Region
7566       Central Region
7567       Central Region
7568          East Region
Name: neighbourhood_group, dtype: category
Categories (5, object): [Central Region, East Region, North Region, North-East Region, West Region]
#Where region is central - Entire home/apt
#Where region is East - Private room
#Where region is North - Private room
#Where region is North-East - Private room
#Where region is West - Private room
#Where region is central - Entire home/apt
c1 = (df['neighbourhood_group'] == 'Central Region') & (df['room_type'].isnull())
df.loc[c1, 'room_type'] = 'Entire home/apt'

#Where region is East - Private room
c2 = (df['neighbourhood_group'] == 'East Region') & (df['room_type'].isnull())
df.loc[c2, 'room_type'] = 'Private room'

#Where region is North - Private room
c3 = (df['neighbourhood_group'] == 'North Region') & (df['room_type'].isnull())
df.loc[c3, 'room_type'] = 'Private room'

#Where region is North-East - Private room
c4 = (df['neighbourhood_group'] == 'North-East Region') & (df['room_type'].isnull())
df.loc[c4, 'room_type'] = 'Private room'

#Where region is West - Private room
c5 = (df['neighbourhood_group'] == 'West Region') & (df['room_type'].isnull())
df.loc[c5, 'room_type'] = 'Private room'

df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                      16
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Minimum Nights

df['minimum_nights'] = pd.to_numeric(df['minimum_nights'], errors='coerce') # Replace incorrect values with NaT
df['minimum_nights'].isnull().value_counts()
False    7884
True       19
Name: minimum_nights, dtype: int64
pd.isnull(df['minimum_nights']).sum()
19
df['minimum_nights'].fillna(0, inplace=True)
df.apply(lambda x: sum(x.isnull()),axis=0)
name                                44
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       2758
reviews_per_month                 2758
calculated_host_listings_count       0
availability_365                   386
dtype: int64
df['name'].unique()
array(['B&B  Room 1 near Airport & EXPO', 'Room 2-near Airport & EXPO',
       '3rd level Jumbo room 5 near EXPO', ...,
       '[ Farrer Park ] New City Fringe CBD Mins to MRT',
       'Cheap Master Room in Central of Singapore',
       'Amazing room with private bathroom walk to Orchard'], dtype=object)
#It is impossible to cater null name values. So we will drop them
df = df.dropna(subset=['name'])
df.isnull().sum()
name                                 0
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       2740
reviews_per_month                 2740
calculated_host_listings_count       0
availability_365                   386
dtype: int64
df = df[~df['room_type'].isin(['1','666'])]
df['host_id'] = pd.to_numeric(df['host_id'], errors='coerce') # Replace incorrect values with NaT
df = df.dropna(subset=['host_id'])
df.isnull().sum()
name                                 0
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       2740
reviews_per_month                 2740
calculated_host_listings_count       0
availability_365                   386
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7857 entries, 4 to 7906
Data columns (total 15 columns):
name                              7857 non-null object
host_id                           7857 non-null float64
host_name                         7857 non-null object
neighbourhood_group               7857 non-null category
neighbourhood                     7857 non-null object
latitude                          7857 non-null float64
longitude                         7857 non-null float64
room_type                         7857 non-null category
price                             7857 non-null int64
minimum_nights                    7857 non-null float64
number_of_reviews                 7857 non-null int64
last_review                       5117 non-null object
reviews_per_month                 5117 non-null float64
calculated_host_listings_count    7857 non-null int64
availability_365                  7471 non-null float64
dtypes: category(2), float64(6), int64(3), object(4)
memory usage: 875.0+ KB
#So far we have cleaned - 
#ID; HOST_ID; NEIGHBOURHOOD_GROUP; NEIGHBOURHOOD; ROOM_TYPE; MINIMUM_NIGHTS

Host Name

#- Step 1- It is a string variable. It should not contain numeric values 
def eliminate_numeric_values(item):
    if isinstance(item, str):
        return item if not re.match(r"^[0-9]+$", item) else pd.NaT
    else:
        return pd.NaT
df['host_name'] = df['host_name'].map(eliminate_numeric_values)
df = df.dropna(subset =['host_name'])
#Step 1 
df['host_name'] = df['host_name'].str.strip()
df['host_name'] = df['host_name'].replace(' ','')
df['host_name'] = df['host_name'].replace("' ",'')
df['host_name'] = df['host_name'].replace("'",'')
#This will do a basic clean up to manage spaces
#Function that checks whether the entry contains English Alphabets
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
#Function that removes Non-Ascii Characters
def remove_non_ascii_1(text):
    return ''.join([i if ord(i) < 128 else '' for i in text])
#Cleaning the Host Name Column
my_list = df['host_name']
counter = 0
for var in my_list:
    #print(var)
    if isEnglish(var):
        continue
    else:
        val = remove_non_ascii_1(var)
        #print(val)
        counter = counter + 1
        my_list.replace(var,val)

        #Replacing all the empty strings with 0 to find accurately the total values changes

df['host_name'] = my_list
df['host_name'].fillna(value=0,inplace=True)

Name Column

df['name'] = df['name'].map(eliminate_numeric_values)
#Step 1 
df['name'] = df['name'].str.strip()
df['name'] = df['name'].replace(' ','')
df['name'] = df['name'].replace("' ",'')
df['name'] = df['name'].replace("'",'')
#This will do a basic clean up to manage spaces
df['name'] = df['name'].str.encode('ascii', 'ignore').str.decode('ascii')
#Cleaning the Host Name Column
list_2 = df['name']
counter = 0
for var in list_2:
    #print(var)
    if isEnglish(var):
        continue
    else:
        val = remove_non_ascii_1(var)
        #print(val)
        counter = counter + 1
        list_2.replace(var,val)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-145-96ccfbe311e6> in <module>
      4 for var in list_2:
      5     #print(var)
----> 6     if isEnglish(var):
      7         continue
      8     else:


<ipython-input-137-1962dfc97554> in isEnglish(s)
      2 def isEnglish(s):
      3     try:
----> 4         s.encode(encoding='utf-8').decode('ascii')
      5     except UnicodeDecodeError:
      6         return False


AttributeError: 'NaTType' object has no attribute 'encode'
list_2
4                         B&B  Room 1 near Airport & EXPO
5                              Room 2-near Airport & EXPO
6                        3rd level Jumbo room 5 near EXPO
7                  Long stay at The Breezy East "Leopard"
8                 Long stay at The Breezy East "Plumeria"
9                 Long stay at The Breezy East "Red Palm"
10      Conveniently located City Room!( (Phone number...
11                  15 mins to Outram MRT Single Room (B)
12                                 Booking for 3 bedrooms
13                         5 mins walk from Newton subway
14                     20 Mins to Sentosa @ Hilltop ! (8)
15                 Accomo@ REDHILL-INSEAD, NTU,NUS -Mu(D)
16           10 mins to Redhill MRT @ Mini Orange Room(5)
17                      Budget short stay room near  EXPO
18        Double room in an Authentic Peranakan Shophouse
19                        5mins from Newton Train Station
20                            5 mins walk from Newton MRT
21                   Heritage Apartment Room near Orchard
22                        Master Bedroom 4-Airport & EXPO
23                     Master Bedroom in Newly Built Flat
24                  Comfortable Condo Room - Boon Lay MRT
25                      Boutique 1 bedroom entiere appart
26                     BEST CITY LIVING WITH GA RESIDENCE
27                  Quiet ensuite room in Holland Village
28                    HDB housing in prime area near town
29                                  Homestay at Serangoon
30        2 Bdrm Spacious Condo_SeaView _10 Mins to Beach
31                                    East Coast Boutique
32                                 Cosy Furnished bedroom
33                         Bedroom for one. Good sea view
                              ...                        
7877      Modern Loft (3 Mins to MRT) - (Retractable Bed)
7878                             10 reason to book-studio
7879                            10 reason to stay -studio
7880         Common room at Orchard/Somerset/Central area
7881     Spacious resort living with private balcony/pond
7882     Lavender Queenbed Rm Heritage windows: 5mwalkMRT
7883                                               ,5,1,,
7884    Oriental Holiday Hotel Style 1 BR Condo at Orc...
7885               Soho living at the Clarke Quay Central
7886    Tampines mrt 3 min away/near airport/ 3 month ...
7887            Cozy loft studio apartment bedroom unit .
7888    Ensuite room in quiet estate in Ang Mo Kio nr MRT
7889    Compact apartment near supermarket/mrt/bus/foo...
7890    Luxurious stay , 3 bedroom apartment at Orchar...
7891              Cozy Unit Nearby Orchard and Ting Bahru
7892    Luxury 1-bed condo unit in the heart of Marina...
7893                             Comfortable House Studio
7894               Convenient Loft Space in Pasir Panjang
7895                Studio Loft Closed to Kent Ridge Park
7896                        V Close-to-CBD En-Suite Room2
7897        New Small Room @Orchard/Somerset/Central Area
7898      Well connected 2 bedroom 2 bathroom apartment !
7899    SMALL ROOM FOR ONE @SOMERSET/ORCHARD/CENTRAL AREA
7900                                                     
7901                  2 PAX LOFT Close To Kent Ridge Park
7902    Loft 2 pax near Haw Par / Pasir Panjang. Free ...
7903                           3bedroom luxury at Orchard
7904      [ Farrer Park ] New City Fringe CBD Mins to MRT
7905            Cheap Master Room in Central of Singapore
7906    Amazing room with private bathroom walk to Orc...
Name: name, Length: 7856, dtype: object
df['name'] = list_2
df['name'].isnull().sum()
2
df.isnull().sum()
name                                 2
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       2740
reviews_per_month                 2740
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Last Review

#Last Review
# Lets see how many rows will be affected if we replace incorrect values with NaT
affected_date_records = pd.to_datetime(df['last_review'], errors='coerce').isnull().sum() - df['last_review'].isnull().sum()
print("Affected Order date records=%d" % affected_date_records)
Affected Order date records=0
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce') # Replace incorrect values with NaT
df.isnull().sum()
name                                 2
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       2740
reviews_per_month                 2740
calculated_host_listings_count       0
availability_365                   386
dtype: int64
review_date = df['last_review'].apply(lambda x: x.year)
review_date.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x264b5cada90>

png

review_date_day = df['last_review'].apply(lambda x: x.day)
review_date_day.hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x264b67878d0>

png

review_date_month = df['last_review'].apply(lambda x: x.month)
review_date_month.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x264b62c8198>

png

df['last_review_year'] = df['last_review'].apply(lambda x: x.year)
df.boxplot(column='last_review_year', sym='o', return_type='axes')
df = df.drop('last_review_year', axis=1)
#NO OUTLIERS IN THE DATE

png

#The only reasonable substitute for missing last review values is to use mode (most occuring value)
#df['last_review'] = pd.to_datetime(df['last_review']).dt.date
#val = df.last_review.mode()
#df['last_review']=pd.to_datetime(df['last_review'].fillna(val
df['last_review'].fillna(df['last_review'].mode()[0], inplace=True)                                                          
df.isnull().sum()
name                                 2
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                          0
reviews_per_month                 2740
calculated_host_listings_count       0
availability_365                   386
dtype: int64

Reviews Per Month

print(df[df['reviews_per_month'].isnull()]['neighbourhood'].value_counts())
Geylang                    344
Kallang                    343
Novena                     224
Bukit Merah                201
Rochor                     173
Downtown Core              145
Outram                     132
Queenstown                 115
Bedok                      114
River Valley                98
Tanglin                     76
Singapore River             64
Jurong East                 47
Jurong West                 47
Clementi                    46
Bukit Timah                 41
Newton                      41
Toa Payoh                   39
Marine Parade               37
Ang Mo Kio                  35
Hougang                     34
Orchard                     34
Pasir Ris                   31
Bukit Batok                 27
Sengkang                    27
Woodlands                   26
Central Water Catchment     23
Bishan                      23
Yishun                      22
Sembawang                   22
Choa Chu Kang               22
Serangoon                   17
Tampines                    17
Punggol                     17
Bukit Panjang               12
Museum                      11
Southern Islands            10
Sungei Kadut                 2
Lim Chu Kang                 1
Name: neighbourhood, dtype: int64
#One approach is to find the average reviews per month with respect to neighbourhood and replace null values with average
df['reviews_per_month'] = df['reviews_per_month'].fillna(df.groupby('neighbourhood')['reviews_per_month'].transform('mean'))

df.isnull().sum()
name                                2
host_id                             0
host_name                           0
neighbourhood_group                 0
neighbourhood                       0
latitude                            0
longitude                           0
room_type                           0
price                               0
minimum_nights                      0
number_of_reviews                   0
last_review                         0
reviews_per_month                   1
calculated_host_listings_count      0
availability_365                  386
dtype: int64
df.dropna(subset = ['reviews_per_month'], inplace = True)
df.isnull().sum()
name                                2
host_id                             0
host_name                           0
neighbourhood_group                 0
neighbourhood                       0
latitude                            0
longitude                           0
room_type                           0
price                               0
minimum_nights                      0
number_of_reviews                   0
last_review                         0
reviews_per_month                   0
calculated_host_listings_count      0
availability_365                  386
dtype: int64
df.groupby('neighbourhood_group')['availability_365'].mean()
neighbourhood_group
Central Region       216.747788
East Region          177.771855
North Region         209.445026
North-East Region    168.764706
West Region          188.258586
Name: availability_365, dtype: float64
df.groupby('neighbourhood')['availability_365'].mean()
neighbourhood
Ang Mo Kio                 169.018519
Bedok                      184.778107
Bishan                     179.763636
Bukit Batok                180.666667
Bukit Merah                212.216401
Bukit Panjang              142.848485
Bukit Timah                158.564516
Central Water Catchment    190.529412
Choa Chu Kang              141.233333
Clementi                   200.978723
Downtown Core              213.786070
Geylang                    196.721294
Hougang                    187.707547
Jurong East                212.579439
Jurong West                195.759398
Kallang                    243.817554
Mandai                     329.500000
Marina South                 0.000000
Marine Parade              132.887417
Museum                     227.616667
Newton                     227.349206
Novena                     234.107280
Orchard                    267.327273
Outram                     225.603037
Pasir Ris                  153.845070
Punggol                    156.578947
Queenstown                 204.949219
River Valley               230.905605
Rochor                     217.173228
Sembawang                  223.921053
Sengkang                   161.650794
Serangoon                  150.854839
Singapore River            261.385621
Southern Islands           197.411765
Sungei Kadut               346.200000
Tampines                   166.616667
Tanglin                    184.005025
Toa Payoh                  158.093750
Tuas                        89.000000
Western Water Catchment    213.750000
Woodlands                  177.967213
Yishun                     230.803922
Name: availability_365, dtype: float64
df.groupby(['neighbourhood_group', 'neighbourhood'])['availability_365'].mean()
neighbourhood_group  neighbourhood          
Central Region       Bishan                     179.763636
                     Bukit Merah                212.216401
                     Bukit Timah                158.564516
                     Downtown Core              213.786070
                     Geylang                    196.721294
                     Kallang                    243.817554
                     Marina South                 0.000000
                     Marine Parade              132.887417
                     Museum                     227.616667
                     Newton                     227.349206
                     Novena                     234.107280
                     Orchard                    267.327273
                     Outram                     225.603037
                     Queenstown                 204.949219
                     River Valley               230.905605
                     Rochor                     217.173228
                     Singapore River            261.385621
                     Southern Islands           197.411765
                     Tanglin                    184.005025
                     Toa Payoh                  158.093750
East Region          Bedok                      184.778107
                     Pasir Ris                  153.845070
                     Tampines                   166.616667
North Region         Central Water Catchment    190.529412
                     Mandai                     329.500000
                     Sembawang                  223.921053
                     Sungei Kadut               346.200000
                     Woodlands                  177.967213
                     Yishun                     230.803922
North-East Region    Ang Mo Kio                 169.018519
                     Hougang                    187.707547
                     Punggol                    156.578947
                     Sengkang                   161.650794
                     Serangoon                  150.854839
West Region          Bukit Batok                180.666667
                     Bukit Panjang              142.848485
                     Choa Chu Kang              141.233333
                     Clementi                   200.978723
                     Jurong East                212.579439
                     Jurong West                195.759398
                     Tuas                        89.000000
                     Western Water Catchment    213.750000
Name: availability_365, dtype: float64
#In availability_365 case we will consider neighbourhood_group for aggregate average as their are less categories in the column
#Hence it is easier to generalize rather than identifying patterns for each neighbourhood
df['availability_365'] = df['availability_365'].fillna(df.groupby('neighbourhood_group')['availability_365'].transform('mean'))
df['availability_365'] = df['availability_365'].apply(np.ceil)
df['availability_365'] = df['availability_365'].astype(int)
df.isnull().sum()
name                              2
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64
#Remove negative values in price column
df = df[~df['price'] < 0]
df['Total_Amount_Paid'] = df.price * df.minimum_nights
df.dtypes
name                                      object
host_id                                  float64
host_name                                 object
neighbourhood_group                     category
neighbourhood                             object
latitude                                 float64
longitude                                float64
room_type                               category
price                                      int64
minimum_nights                           float64
number_of_reviews                          int64
last_review                       datetime64[ns]
reviews_per_month                        float64
calculated_host_listings_count             int64
availability_365                           int32
Total_Amount_Paid                        float64
dtype: object
#df.head()

T-Test

#df.dtypes
df['host_id'] = df['host_id'].astype(np.float64)
df['latitude'] = df['latitude'].astype(np.float64)
df['longitude'] = df['longitude'].astype(np.float64)
df['price'] = df['price'].astype(np.int64)
df['minimum_nights'] = df['minimum_nights'].astype(np.float64)
df['reviews_per_month'] = df['reviews_per_month'].astype(np.float64)
df['calculated_host_listings_count'] = df['calculated_host_listings_count'].astype(np.int64)
df['availability_365'] = df['availability_365'].astype(np.int64)

#df.dtypes
num_columns = ['latitude','longitude','price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365','Total_Amount_Paid']
for i in range(len(num_columns)-1):
    for j in range(i+1,len(num_columns)):
        col1 = num_columns[i]
        col2 = num_columns[j]
        t_val, p_val = stats.ttest_ind(df[col1], df[col2])
        print("(%s,%s) => t-value= %s, p-value= %s" % (num_columns[i], num_columns[j], str(t_val), str(p_val)))
(latitude,longitude) => t-value= -170919.36414808003, p-value= 0.0
(latitude,price) => t-value= -43.6321774495353, p-value= 0.0
(latitude,minimum_nights) => t-value= -34.03071097811187, p-value= 5.633678112890004e-245
(latitude,number_of_reviews) => t-value= -34.348954053952745, p-value= 2.2337598176830605e-249
(latitude,reviews_per_month) => t-value= 23.12874867308488, p-value= 2.0824977842009198e-116
(latitude,calculated_host_listings_count) => t-value= -52.749504931210474, p-value= 0.0
(latitude,availability_365) => t-value= -129.95419820608618, p-value= 0.0
(latitude,Total_Amount_Paid) => t-value= -8.707646818327206, p-value= 3.407021249021193e-18
(longitude,price) => t-value= -17.16600299670573, p-value= 1.886697022746164e-65
(longitude,minimum_nights) => t-value= 182.82098719702753, p-value= 0.0
(longitude,number_of_reviews) => t-value= 270.62366722109147, p-value= 0.0
(longitude,reviews_per_month) => t-value= 8651.225763993309, p-value= 0.0
(longitude,calculated_host_listings_count) => t-value= 85.74384346373408, p-value= 0.0
(longitude,availability_365) => t-value= -66.11652424476627, p-value= 0.0
(longitude,Total_Amount_Paid) => t-value= -8.389193338830815, p-value= 5.3070740003692973e-17
(price,minimum_nights) => t-value= 39.18803595198126, p-value= 0.0
(price,number_of_reviews) => t-value= 40.49908780840161, p-value= 0.0
(price,reviews_per_month) => t-value= 43.70288747822712, p-value= 0.0
(price,calculated_host_listings_count) => t-value= 32.95535154190281, p-value= 2.3044901452535385e-230
(price,availability_365) => t-value= -9.463855524145997, p-value= 3.3801865545094383e-21
(price,Total_Amount_Paid) => t-value= -8.18205162356193, p-value= 3.0026444530106522e-16
(minimum_nights,number_of_reviews) => t-value= 7.829323076567633, p-value= 5.216706890349251e-15
(minimum_nights,reviews_per_month) => t-value= 34.60085828310095, p-value= 6.916870345811379e-253
(minimum_nights,calculated_host_listings_count) => t-value= -26.139543271099882, p-value= 1.8168666974263962e-147
(minimum_nights,availability_365) => t-value= -115.05426757955541, p-value= 0.0
(minimum_nights,Total_Amount_Paid) => t-value= -8.657662334400214, p-value= 5.276951024374404e-18
(number_of_reviews,reviews_per_month) => t-value= 35.14422568674903, p-value= 1.5700998993666541e-260
(number_of_reviews,calculated_host_listings_count) => t-value= -33.826538495853185, p-value= 3.599244630469962e-242
(number_of_reviews,availability_365) => t-value= -120.1599548453521, p-value= 0.0
(number_of_reviews,Total_Amount_Paid) => t-value= -8.671774809807381, p-value= 4.664914488636321e-18
(reviews_per_month,calculated_host_listings_count) => t-value= -53.11376816507319, p-value= 0.0
(reviews_per_month,availability_365) => t-value= -130.1216963666084, p-value= 0.0
(reviews_per_month,Total_Amount_Paid) => t-value= -8.708500096403716, p-value= 3.3815987188544218e-18
(calculated_host_listings_count,availability_365) => t-value= -95.93819183393774, p-value= 0.0
(calculated_host_listings_count,Total_Amount_Paid) => t-value= -8.586331202038139, p-value= 9.810885494102484e-18
(availability_365,Total_Amount_Paid) => t-value= -8.05927154882726, p-value= 8.223599409152149e-16

As shown in the results above all p-value for every pair turns out to be less than 0.05 (p-val < 0.05), infact its very small for every pair. Thus, there is a statistically significant difference between all above mentioned numeric columns

1 - Way Anova

#Categorical Columns
categorical = list((set(df.columns) - set(num_columns)) - set(['last_review']))
print(categorical)
['neighbourhood', 'room_type', 'host_name', 'host_id', 'name', 'neighbourhood_group']
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Taking 1-way anova of Sales with all categorical variables
for i in categorical:
    model = ols('Total_Amount_Paid ~ C(Q("%s"))'% i, data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    print ("\nAnova => Total_Amount_Paid - %s" % i)
    display(anova_table)

Anova => Total_Amount_Paid - neighbourhood
sum_sq df F PR(>F)
C(Q("neighbourhood")) 3.146232e+10 41.0 0.942424 0.576012
Residual 6.359336e+12 7810.0 NaN NaN
Anova => Total_Amount_Paid - room_type
sum_sq df F PR(>F)
C(Q("room_type")) 1.039528e+10 2.0 6.393996 0.00168
Residual 6.380403e+12 7849.0 NaN NaN
Anova => Total_Amount_Paid - host_name
sum_sq df F PR(>F)
C(Q("host_name")) 1.801270e+12 1818.0 1.30199 4.624713e-13
Residual 4.589516e+12 6031.0 NaN NaN
Anova => Total_Amount_Paid - host_id
sum_sq df F PR(>F)
C(Q("host_id")) 6.331497e+12 2684.0 205.540313 0.0
Residual 5.930145e+10 5167.0 NaN NaN
Anova => Total_Amount_Paid - name
sum_sq df F PR(>F)
C(Q("name")) 6.388899e+12 7320.0 244.018545 0.0
Residual 1.899270e+09 531.0 NaN NaN
Anova => Total_Amount_Paid - neighbourhood_group
sum_sq df F PR(>F)
C(Q("neighbourhood_group")) 1.154940e+10 4.0 3.551678 0.006696
Residual 6.379249e+12 7847.0 NaN NaN

With our null hypothesis being Ho = no difference between population means of Total_Amount_Paid and groups(categorical columns). Above results of 1-way Anova indicates that for room_type, neighbourhood_group, name and host_id p-value turns out to be less than 0.05, so we can reject our null hypothesis thus we can state with 95% confidence that difference between means exist for sales and groups in above mentioned columns


Correlation Heatmap

import matplotlib as plt

matrix = np.triu(df.corr())
sns.heatmap(df.corr(),annot=True, fmt='.1g',vmin=-1, vmax=1, center= 0,cmap='coolwarm',mask=matrix)
<matplotlib.axes._subplots.AxesSubplot at 0x264b7e3f668>

png

The correlation of Total paid with price and minimum nights is quite straight forward. Availability column is somewhat related to minimum nights and host listings count. Rest of the variables are weakly correlated.