Air Bnb Data Cleaning
#importing relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import re
import string
from scipy import stats
# Reading Data
df = pd.read_csv('singapore_airbnb_dirty_data.csv')
#Understanding the data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7907 entries, 0 to 7906
Data columns (total 16 columns):
id 7895 non-null object
name 7863 non-null object
host_id 7907 non-null object
host_name 7907 non-null object
neighbourhood_group 7841 non-null object
neighbourhood 7870 non-null object
latitude 7907 non-null float64
longitude 7907 non-null float64
room_type 7890 non-null object
price 7907 non-null int64
minimum_nights 7891 non-null object
number_of_reviews 7907 non-null int64
last_review 5149 non-null object
reviews_per_month 5149 non-null float64
calculated_host_listings_count 7907 non-null int64
availability_365 7521 non-null float64
dtypes: float64(4), int64(3), object(9)
memory usage: 988.5+ KB
#Some of the columns should be category type. Most of them are object type as data is dirty and inconsistent
#Columns to be converted to Category: neighbourhood_group, room_type
#Missing values in Each column
print('Column Name' + '\t\t\t' + 'Null Values')
df.apply(lambda x: sum(x.isnull()),axis=0)
Column Name Null Values
id 12
name 44
host_id 0
host_name 0
neighbourhood_group 66
neighbourhood 37
latitude 0
longitude 0
room_type 17
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
#Checking Unique Identifiers in our Data
print('Column Name' + '\t\t\t' + 'IsUnique')
df.apply(lambda x: x.is_unique,axis=0)
Column Name IsUnique
id False
name False
host_id False
host_name False
neighbourhood_group False
neighbourhood False
latitude False
longitude False
room_type False
price False
minimum_nights False
number_of_reviews False
last_review False
reviews_per_month False
calculated_host_listings_count False
availability_365 False
dtype: bool
#Handling ID Column
#The above output shows that ID is not unique. Hence, it can be dropped
df = df.drop('id',axis=1)
#Improvement 1 - ID column Dropped
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 66
neighbourhood 37
latitude 0
longitude 0
room_type 17
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
Neighbourhood Column
#Neighbourhood group values for records that are missing neighbourhood
print(df[df['neighbourhood'].isnull()]['neighbourhood_group'].value_counts())
Central Region 31
East Region 3
North-East Region 2
North Region 1
Name: neighbourhood_group, dtype: int64
#Looking for neighbourhood value that occurs the most to be used as value in place of null group values
print(df[df['neighbourhood_group'] == 'Central Region']['neighbourhood'].value_counts())
Kallang 1034
Geylang 972
Rochor 531
Novena 529
Outram 472
Bukit Merah 463
Downtown Core 426
River Valley 357
Queenstown 260
Tanglin 206
Singapore River 174
Marine Parade 171
Orchard 135
Newton 133
Bukit Timah 127
Toa Payoh 99
Museum 62
Bishan 56
Southern Islands 16
Marina South 1
Name: neighbourhood, dtype: int64
#Replacing Central Region Neighbourhood values with Kallang
cond = (df['neighbourhood_group'] == 'Central Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Kallang'
print(df[df['neighbourhood_group'] == 'East Region']['neighbourhood'].value_counts())
Bedok 369
Pasir Ris 71
Tampines 63
Name: neighbourhood, dtype: int64
#Replacing East Region Neighbourhood values with Bedok
cond = (df['neighbourhood_group'] == 'East Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Bedok'
print(df[df['neighbourhood_group'] == 'North-East Region']['neighbourhood'].value_counts())
Hougang 108
Serangoon 68
Sengkang 67
Ang Mo Kio 56
Punggol 42
Name: neighbourhood, dtype: int64
#Replacing NE Region Neighbourhood values with Hougang
cond = (df['neighbourhood_group'] == 'North-East Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Hougang'
print(df[df['neighbourhood_group'] == 'North Region']['neighbourhood'].value_counts())
Woodlands 66
Yishun 52
Sembawang 40
Central Water Catchment 34
Sungei Kadut 5
Mandai 3
Lim Chu Kang 1
Name: neighbourhood, dtype: int64
#Replacing North Region Neighbourhood values with Woodlands
cond = (df['neighbourhood_group'] == 'North Region') & (df['neighbourhood'].isnull())
df.loc[cond, 'neighbourhood'] = 'Woodlands'
#Neighbourhood Column is now cleaned
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 66
neighbourhood 0
latitude 0
longitude 0
room_type 17
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
Neighbourhood Group Column
print(df[df['neighbourhood_group'].isnull()]['neighbourhood'].value_counts())
Geylang 12
River Valley 5
Rochor 4
Bukit Merah 4
Tanglin 4
Novena 4
Kallang 3
Bukit Timah 3
Queenstown 3
Outram 3
Downtown Core 2
Toa Payoh 2
Bedok 2
Newton 1
Jurong West 1
Ang Mo Kio 1
Bukit Panjang 1
Serangoon 1
Yishun 1
Choa Chu Kang 1
Bukit Batok 1
Bishan 1
Hougang 1
Singapore River 1
Sembawang 1
Orchard 1
Southern Islands 1
Clementi 1
Name: neighbourhood, dtype: int64
df['neighbourhood_group'].value_counts()
Central Region 6255
West Region 535
East Region 506
North-East Region 343
North Region 202
Name: neighbourhood_group, dtype: int64
#Code to replace the missing neighbourhood group values
x = df['neighbourhood'].unique()
for val in x:
i = str(df[df['neighbourhood'] == val]['neighbourhood_group'].value_counts())
c = re.match('\w+(?:-\w+)+\s.[a-z]*|.[a-z]*\s.[a-z]*',i)
cond = (df['neighbourhood'] == val) & (df['neighbourhood_group'].isnull())
df.loc[cond, 'neighbourhood_group'] = c[0]
df['neighbourhood_group'].value_counts()
Central Region 6309
West Region 540
East Region 508
North-East Region 346
North Region 204
Name: neighbourhood_group, dtype: int64
#Neighbourhood Group column is cleaned
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 17
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
Room Type Column
df['room_type'].value_counts()
Entire home/apt 4126
Private room 3368
Shared room 392
2 1
4 1
1 1
3 1
Name: room_type, dtype: int64
#There are data entry errors in Room type column. We dont know what this encoding means so we will drop these values as
#they are small in number.
df = df[~df['room_type'].isin(['1','2','3','4'])]
df['room_type'].value_counts()
Entire home/apt 4126
Private room 3368
Shared room 392
Name: room_type, dtype: int64
df = df.astype({"room_type":'category'})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7903 entries, 4 to 7906
Data columns (total 15 columns):
name 7859 non-null object
host_id 7903 non-null object
host_name 7903 non-null object
neighbourhood_group 7903 non-null object
neighbourhood 7903 non-null object
latitude 7903 non-null float64
longitude 7903 non-null float64
room_type 7886 non-null category
price 7903 non-null int64
minimum_nights 7887 non-null object
number_of_reviews 7903 non-null int64
last_review 5145 non-null object
reviews_per_month 5145 non-null float64
calculated_host_listings_count 7903 non-null int64
availability_365 7517 non-null float64
dtypes: category(1), float64(4), int64(3), object(7)
memory usage: 934.0+ KB
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 17
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
#There are some errors in the data entry that need to be removed
df['neighbourhood_group'].value_counts()
Central Region 6308
West Region 540
East Region 507
North-East Region 346
North Region 202
Name: neighbourhood_group, dtype: int64
df = df.astype({"neighbourhood_group":'category', "room_type":'category'})
df.groupby(['neighbourhood_group', 'room_type']).size()
neighbourhood_group room_type
Central Region Entire home/apt 3733
Private room 2217
Shared room 346
East Region Entire home/apt 130
Private room 364
Shared room 11
North Region Entire home/apt 53
Private room 140
Shared room 9
North-East Region Entire home/apt 64
Private room 270
Shared room 10
West Region Entire home/apt 146
Private room 377
Shared room 16
dtype: int64
#We will cater the missing room type values based on the neighbourhood group of the records
df[df["room_type"].isnull()]['neighbourhood_group']
2252 Central Region
2253 North-East Region
2254 East Region
2255 Central Region
2256 North-East Region
2257 Central Region
2258 Central Region
2259 Central Region
7560 Central Region
7561 West Region
7562 Central Region
7563 Central Region
7564 Central Region
7565 Central Region
7566 Central Region
7567 Central Region
7568 East Region
Name: neighbourhood_group, dtype: category
Categories (5, object): [Central Region, East Region, North Region, North-East Region, West Region]
#Where region is central - Entire home/apt
#Where region is East - Private room
#Where region is North - Private room
#Where region is North-East - Private room
#Where region is West - Private room
#Where region is central - Entire home/apt
c1 = (df['neighbourhood_group'] == 'Central Region') & (df['room_type'].isnull())
df.loc[c1, 'room_type'] = 'Entire home/apt'
#Where region is East - Private room
c2 = (df['neighbourhood_group'] == 'East Region') & (df['room_type'].isnull())
df.loc[c2, 'room_type'] = 'Private room'
#Where region is North - Private room
c3 = (df['neighbourhood_group'] == 'North Region') & (df['room_type'].isnull())
df.loc[c3, 'room_type'] = 'Private room'
#Where region is North-East - Private room
c4 = (df['neighbourhood_group'] == 'North-East Region') & (df['room_type'].isnull())
df.loc[c4, 'room_type'] = 'Private room'
#Where region is West - Private room
c5 = (df['neighbourhood_group'] == 'West Region') & (df['room_type'].isnull())
df.loc[c5, 'room_type'] = 'Private room'
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 16
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
Minimum Nights
df['minimum_nights'] = pd.to_numeric(df['minimum_nights'], errors='coerce') # Replace incorrect values with NaT
df['minimum_nights'].isnull().value_counts()
False 7884
True 19
Name: minimum_nights, dtype: int64
pd.isnull(df['minimum_nights']).sum()
19
df['minimum_nights'].fillna(0, inplace=True)
df.apply(lambda x: sum(x.isnull()),axis=0)
name 44
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2758
reviews_per_month 2758
calculated_host_listings_count 0
availability_365 386
dtype: int64
df['name'].unique()
array(['B&B Room 1 near Airport & EXPO', 'Room 2-near Airport & EXPO',
'3rd level Jumbo room 5 near EXPO', ...,
'[ Farrer Park ] New City Fringe CBD Mins to MRT',
'Cheap Master Room in Central of Singapore',
'Amazing room with private bathroom walk to Orchard'], dtype=object)
#It is impossible to cater null name values. So we will drop them
df = df.dropna(subset=['name'])
df.isnull().sum()
name 0
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2740
reviews_per_month 2740
calculated_host_listings_count 0
availability_365 386
dtype: int64
df = df[~df['room_type'].isin(['1','666'])]
df['host_id'] = pd.to_numeric(df['host_id'], errors='coerce') # Replace incorrect values with NaT
df = df.dropna(subset=['host_id'])
df.isnull().sum()
name 0
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2740
reviews_per_month 2740
calculated_host_listings_count 0
availability_365 386
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7857 entries, 4 to 7906
Data columns (total 15 columns):
name 7857 non-null object
host_id 7857 non-null float64
host_name 7857 non-null object
neighbourhood_group 7857 non-null category
neighbourhood 7857 non-null object
latitude 7857 non-null float64
longitude 7857 non-null float64
room_type 7857 non-null category
price 7857 non-null int64
minimum_nights 7857 non-null float64
number_of_reviews 7857 non-null int64
last_review 5117 non-null object
reviews_per_month 5117 non-null float64
calculated_host_listings_count 7857 non-null int64
availability_365 7471 non-null float64
dtypes: category(2), float64(6), int64(3), object(4)
memory usage: 875.0+ KB
#So far we have cleaned -
#ID; HOST_ID; NEIGHBOURHOOD_GROUP; NEIGHBOURHOOD; ROOM_TYPE; MINIMUM_NIGHTS
Host Name
#- Step 1- It is a string variable. It should not contain numeric values
def eliminate_numeric_values(item):
if isinstance(item, str):
return item if not re.match(r"^[0-9]+$", item) else pd.NaT
else:
return pd.NaT
df['host_name'] = df['host_name'].map(eliminate_numeric_values)
df = df.dropna(subset =['host_name'])
#Step 1
df['host_name'] = df['host_name'].str.strip()
df['host_name'] = df['host_name'].replace(' ','')
df['host_name'] = df['host_name'].replace("' ",'')
df['host_name'] = df['host_name'].replace("'",'')
#This will do a basic clean up to manage spaces
#Function that checks whether the entry contains English Alphabets
def isEnglish(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
#Function that removes Non-Ascii Characters
def remove_non_ascii_1(text):
return ''.join([i if ord(i) < 128 else '' for i in text])
#Cleaning the Host Name Column
my_list = df['host_name']
counter = 0
for var in my_list:
#print(var)
if isEnglish(var):
continue
else:
val = remove_non_ascii_1(var)
#print(val)
counter = counter + 1
my_list.replace(var,val)
#Replacing all the empty strings with 0 to find accurately the total values changes
df['host_name'] = my_list
df['host_name'].fillna(value=0,inplace=True)
Name Column
df['name'] = df['name'].map(eliminate_numeric_values)
#Step 1
df['name'] = df['name'].str.strip()
df['name'] = df['name'].replace(' ','')
df['name'] = df['name'].replace("' ",'')
df['name'] = df['name'].replace("'",'')
#This will do a basic clean up to manage spaces
df['name'] = df['name'].str.encode('ascii', 'ignore').str.decode('ascii')
#Cleaning the Host Name Column
list_2 = df['name']
counter = 0
for var in list_2:
#print(var)
if isEnglish(var):
continue
else:
val = remove_non_ascii_1(var)
#print(val)
counter = counter + 1
list_2.replace(var,val)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-145-96ccfbe311e6> in <module>
4 for var in list_2:
5 #print(var)
----> 6 if isEnglish(var):
7 continue
8 else:
<ipython-input-137-1962dfc97554> in isEnglish(s)
2 def isEnglish(s):
3 try:
----> 4 s.encode(encoding='utf-8').decode('ascii')
5 except UnicodeDecodeError:
6 return False
AttributeError: 'NaTType' object has no attribute 'encode'
list_2
4 B&B Room 1 near Airport & EXPO
5 Room 2-near Airport & EXPO
6 3rd level Jumbo room 5 near EXPO
7 Long stay at The Breezy East "Leopard"
8 Long stay at The Breezy East "Plumeria"
9 Long stay at The Breezy East "Red Palm"
10 Conveniently located City Room!( (Phone number...
11 15 mins to Outram MRT Single Room (B)
12 Booking for 3 bedrooms
13 5 mins walk from Newton subway
14 20 Mins to Sentosa @ Hilltop ! (8)
15 Accomo@ REDHILL-INSEAD, NTU,NUS -Mu(D)
16 10 mins to Redhill MRT @ Mini Orange Room(5)
17 Budget short stay room near EXPO
18 Double room in an Authentic Peranakan Shophouse
19 5mins from Newton Train Station
20 5 mins walk from Newton MRT
21 Heritage Apartment Room near Orchard
22 Master Bedroom 4-Airport & EXPO
23 Master Bedroom in Newly Built Flat
24 Comfortable Condo Room - Boon Lay MRT
25 Boutique 1 bedroom entiere appart
26 BEST CITY LIVING WITH GA RESIDENCE
27 Quiet ensuite room in Holland Village
28 HDB housing in prime area near town
29 Homestay at Serangoon
30 2 Bdrm Spacious Condo_SeaView _10 Mins to Beach
31 East Coast Boutique
32 Cosy Furnished bedroom
33 Bedroom for one. Good sea view
...
7877 Modern Loft (3 Mins to MRT) - (Retractable Bed)
7878 10 reason to book-studio
7879 10 reason to stay -studio
7880 Common room at Orchard/Somerset/Central area
7881 Spacious resort living with private balcony/pond
7882 Lavender Queenbed Rm Heritage windows: 5mwalkMRT
7883 ,5,1,,
7884 Oriental Holiday Hotel Style 1 BR Condo at Orc...
7885 Soho living at the Clarke Quay Central
7886 Tampines mrt 3 min away/near airport/ 3 month ...
7887 Cozy loft studio apartment bedroom unit .
7888 Ensuite room in quiet estate in Ang Mo Kio nr MRT
7889 Compact apartment near supermarket/mrt/bus/foo...
7890 Luxurious stay , 3 bedroom apartment at Orchar...
7891 Cozy Unit Nearby Orchard and Ting Bahru
7892 Luxury 1-bed condo unit in the heart of Marina...
7893 Comfortable House Studio
7894 Convenient Loft Space in Pasir Panjang
7895 Studio Loft Closed to Kent Ridge Park
7896 V Close-to-CBD En-Suite Room2
7897 New Small Room @Orchard/Somerset/Central Area
7898 Well connected 2 bedroom 2 bathroom apartment !
7899 SMALL ROOM FOR ONE @SOMERSET/ORCHARD/CENTRAL AREA
7900
7901 2 PAX LOFT Close To Kent Ridge Park
7902 Loft 2 pax near Haw Par / Pasir Panjang. Free ...
7903 3bedroom luxury at Orchard
7904 [ Farrer Park ] New City Fringe CBD Mins to MRT
7905 Cheap Master Room in Central of Singapore
7906 Amazing room with private bathroom walk to Orc...
Name: name, Length: 7856, dtype: object
df['name'] = list_2
df['name'].isnull().sum()
2
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2740
reviews_per_month 2740
calculated_host_listings_count 0
availability_365 386
dtype: int64
Last Review
#Last Review
# Lets see how many rows will be affected if we replace incorrect values with NaT
affected_date_records = pd.to_datetime(df['last_review'], errors='coerce').isnull().sum() - df['last_review'].isnull().sum()
print("Affected Order date records=%d" % affected_date_records)
Affected Order date records=0
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce') # Replace incorrect values with NaT
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 2740
reviews_per_month 2740
calculated_host_listings_count 0
availability_365 386
dtype: int64
review_date = df['last_review'].apply(lambda x: x.year)
review_date.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x264b5cada90>
review_date_day = df['last_review'].apply(lambda x: x.day)
review_date_day.hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x264b67878d0>
review_date_month = df['last_review'].apply(lambda x: x.month)
review_date_month.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x264b62c8198>
df['last_review_year'] = df['last_review'].apply(lambda x: x.year)
df.boxplot(column='last_review_year', sym='o', return_type='axes')
df = df.drop('last_review_year', axis=1)
#NO OUTLIERS IN THE DATE
#The only reasonable substitute for missing last review values is to use mode (most occuring value)
#df['last_review'] = pd.to_datetime(df['last_review']).dt.date
#val = df.last_review.mode()
#df['last_review']=pd.to_datetime(df['last_review'].fillna(val
df['last_review'].fillna(df['last_review'].mode()[0], inplace=True)
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 0
reviews_per_month 2740
calculated_host_listings_count 0
availability_365 386
dtype: int64
Reviews Per Month
print(df[df['reviews_per_month'].isnull()]['neighbourhood'].value_counts())
Geylang 344
Kallang 343
Novena 224
Bukit Merah 201
Rochor 173
Downtown Core 145
Outram 132
Queenstown 115
Bedok 114
River Valley 98
Tanglin 76
Singapore River 64
Jurong East 47
Jurong West 47
Clementi 46
Bukit Timah 41
Newton 41
Toa Payoh 39
Marine Parade 37
Ang Mo Kio 35
Hougang 34
Orchard 34
Pasir Ris 31
Bukit Batok 27
Sengkang 27
Woodlands 26
Central Water Catchment 23
Bishan 23
Yishun 22
Sembawang 22
Choa Chu Kang 22
Serangoon 17
Tampines 17
Punggol 17
Bukit Panjang 12
Museum 11
Southern Islands 10
Sungei Kadut 2
Lim Chu Kang 1
Name: neighbourhood, dtype: int64
#One approach is to find the average reviews per month with respect to neighbourhood and replace null values with average
df['reviews_per_month'] = df['reviews_per_month'].fillna(df.groupby('neighbourhood')['reviews_per_month'].transform('mean'))
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 0
reviews_per_month 1
calculated_host_listings_count 0
availability_365 386
dtype: int64
df.dropna(subset = ['reviews_per_month'], inplace = True)
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 0
reviews_per_month 0
calculated_host_listings_count 0
availability_365 386
dtype: int64
df.groupby('neighbourhood_group')['availability_365'].mean()
neighbourhood_group
Central Region 216.747788
East Region 177.771855
North Region 209.445026
North-East Region 168.764706
West Region 188.258586
Name: availability_365, dtype: float64
df.groupby('neighbourhood')['availability_365'].mean()
neighbourhood
Ang Mo Kio 169.018519
Bedok 184.778107
Bishan 179.763636
Bukit Batok 180.666667
Bukit Merah 212.216401
Bukit Panjang 142.848485
Bukit Timah 158.564516
Central Water Catchment 190.529412
Choa Chu Kang 141.233333
Clementi 200.978723
Downtown Core 213.786070
Geylang 196.721294
Hougang 187.707547
Jurong East 212.579439
Jurong West 195.759398
Kallang 243.817554
Mandai 329.500000
Marina South 0.000000
Marine Parade 132.887417
Museum 227.616667
Newton 227.349206
Novena 234.107280
Orchard 267.327273
Outram 225.603037
Pasir Ris 153.845070
Punggol 156.578947
Queenstown 204.949219
River Valley 230.905605
Rochor 217.173228
Sembawang 223.921053
Sengkang 161.650794
Serangoon 150.854839
Singapore River 261.385621
Southern Islands 197.411765
Sungei Kadut 346.200000
Tampines 166.616667
Tanglin 184.005025
Toa Payoh 158.093750
Tuas 89.000000
Western Water Catchment 213.750000
Woodlands 177.967213
Yishun 230.803922
Name: availability_365, dtype: float64
df.groupby(['neighbourhood_group', 'neighbourhood'])['availability_365'].mean()
neighbourhood_group neighbourhood
Central Region Bishan 179.763636
Bukit Merah 212.216401
Bukit Timah 158.564516
Downtown Core 213.786070
Geylang 196.721294
Kallang 243.817554
Marina South 0.000000
Marine Parade 132.887417
Museum 227.616667
Newton 227.349206
Novena 234.107280
Orchard 267.327273
Outram 225.603037
Queenstown 204.949219
River Valley 230.905605
Rochor 217.173228
Singapore River 261.385621
Southern Islands 197.411765
Tanglin 184.005025
Toa Payoh 158.093750
East Region Bedok 184.778107
Pasir Ris 153.845070
Tampines 166.616667
North Region Central Water Catchment 190.529412
Mandai 329.500000
Sembawang 223.921053
Sungei Kadut 346.200000
Woodlands 177.967213
Yishun 230.803922
North-East Region Ang Mo Kio 169.018519
Hougang 187.707547
Punggol 156.578947
Sengkang 161.650794
Serangoon 150.854839
West Region Bukit Batok 180.666667
Bukit Panjang 142.848485
Choa Chu Kang 141.233333
Clementi 200.978723
Jurong East 212.579439
Jurong West 195.759398
Tuas 89.000000
Western Water Catchment 213.750000
Name: availability_365, dtype: float64
#In availability_365 case we will consider neighbourhood_group for aggregate average as their are less categories in the column
#Hence it is easier to generalize rather than identifying patterns for each neighbourhood
df['availability_365'] = df['availability_365'].fillna(df.groupby('neighbourhood_group')['availability_365'].transform('mean'))
df['availability_365'] = df['availability_365'].apply(np.ceil)
df['availability_365'] = df['availability_365'].astype(int)
df.isnull().sum()
name 2
host_id 0
host_name 0
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 0
reviews_per_month 0
calculated_host_listings_count 0
availability_365 0
dtype: int64
#Remove negative values in price column
df = df[~df['price'] < 0]
df['Total_Amount_Paid'] = df.price * df.minimum_nights
df.dtypes
name object
host_id float64
host_name object
neighbourhood_group category
neighbourhood object
latitude float64
longitude float64
room_type category
price int64
minimum_nights float64
number_of_reviews int64
last_review datetime64[ns]
reviews_per_month float64
calculated_host_listings_count int64
availability_365 int32
Total_Amount_Paid float64
dtype: object
#df.head()
T-Test
#df.dtypes
df['host_id'] = df['host_id'].astype(np.float64)
df['latitude'] = df['latitude'].astype(np.float64)
df['longitude'] = df['longitude'].astype(np.float64)
df['price'] = df['price'].astype(np.int64)
df['minimum_nights'] = df['minimum_nights'].astype(np.float64)
df['reviews_per_month'] = df['reviews_per_month'].astype(np.float64)
df['calculated_host_listings_count'] = df['calculated_host_listings_count'].astype(np.int64)
df['availability_365'] = df['availability_365'].astype(np.int64)
#df.dtypes
num_columns = ['latitude','longitude','price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365','Total_Amount_Paid']
for i in range(len(num_columns)-1):
for j in range(i+1,len(num_columns)):
col1 = num_columns[i]
col2 = num_columns[j]
t_val, p_val = stats.ttest_ind(df[col1], df[col2])
print("(%s,%s) => t-value= %s, p-value= %s" % (num_columns[i], num_columns[j], str(t_val), str(p_val)))
(latitude,longitude) => t-value= -170919.36414808003, p-value= 0.0
(latitude,price) => t-value= -43.6321774495353, p-value= 0.0
(latitude,minimum_nights) => t-value= -34.03071097811187, p-value= 5.633678112890004e-245
(latitude,number_of_reviews) => t-value= -34.348954053952745, p-value= 2.2337598176830605e-249
(latitude,reviews_per_month) => t-value= 23.12874867308488, p-value= 2.0824977842009198e-116
(latitude,calculated_host_listings_count) => t-value= -52.749504931210474, p-value= 0.0
(latitude,availability_365) => t-value= -129.95419820608618, p-value= 0.0
(latitude,Total_Amount_Paid) => t-value= -8.707646818327206, p-value= 3.407021249021193e-18
(longitude,price) => t-value= -17.16600299670573, p-value= 1.886697022746164e-65
(longitude,minimum_nights) => t-value= 182.82098719702753, p-value= 0.0
(longitude,number_of_reviews) => t-value= 270.62366722109147, p-value= 0.0
(longitude,reviews_per_month) => t-value= 8651.225763993309, p-value= 0.0
(longitude,calculated_host_listings_count) => t-value= 85.74384346373408, p-value= 0.0
(longitude,availability_365) => t-value= -66.11652424476627, p-value= 0.0
(longitude,Total_Amount_Paid) => t-value= -8.389193338830815, p-value= 5.3070740003692973e-17
(price,minimum_nights) => t-value= 39.18803595198126, p-value= 0.0
(price,number_of_reviews) => t-value= 40.49908780840161, p-value= 0.0
(price,reviews_per_month) => t-value= 43.70288747822712, p-value= 0.0
(price,calculated_host_listings_count) => t-value= 32.95535154190281, p-value= 2.3044901452535385e-230
(price,availability_365) => t-value= -9.463855524145997, p-value= 3.3801865545094383e-21
(price,Total_Amount_Paid) => t-value= -8.18205162356193, p-value= 3.0026444530106522e-16
(minimum_nights,number_of_reviews) => t-value= 7.829323076567633, p-value= 5.216706890349251e-15
(minimum_nights,reviews_per_month) => t-value= 34.60085828310095, p-value= 6.916870345811379e-253
(minimum_nights,calculated_host_listings_count) => t-value= -26.139543271099882, p-value= 1.8168666974263962e-147
(minimum_nights,availability_365) => t-value= -115.05426757955541, p-value= 0.0
(minimum_nights,Total_Amount_Paid) => t-value= -8.657662334400214, p-value= 5.276951024374404e-18
(number_of_reviews,reviews_per_month) => t-value= 35.14422568674903, p-value= 1.5700998993666541e-260
(number_of_reviews,calculated_host_listings_count) => t-value= -33.826538495853185, p-value= 3.599244630469962e-242
(number_of_reviews,availability_365) => t-value= -120.1599548453521, p-value= 0.0
(number_of_reviews,Total_Amount_Paid) => t-value= -8.671774809807381, p-value= 4.664914488636321e-18
(reviews_per_month,calculated_host_listings_count) => t-value= -53.11376816507319, p-value= 0.0
(reviews_per_month,availability_365) => t-value= -130.1216963666084, p-value= 0.0
(reviews_per_month,Total_Amount_Paid) => t-value= -8.708500096403716, p-value= 3.3815987188544218e-18
(calculated_host_listings_count,availability_365) => t-value= -95.93819183393774, p-value= 0.0
(calculated_host_listings_count,Total_Amount_Paid) => t-value= -8.586331202038139, p-value= 9.810885494102484e-18
(availability_365,Total_Amount_Paid) => t-value= -8.05927154882726, p-value= 8.223599409152149e-16
As shown in the results above all p-value for every pair turns out to be less than 0.05 (p-val < 0.05), infact its very small for every pair. Thus, there is a statistically significant difference between all above mentioned numeric columns
1 - Way Anova
#Categorical Columns
categorical = list((set(df.columns) - set(num_columns)) - set(['last_review']))
print(categorical)
['neighbourhood', 'room_type', 'host_name', 'host_id', 'name', 'neighbourhood_group']
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Taking 1-way anova of Sales with all categorical variables
for i in categorical:
model = ols('Total_Amount_Paid ~ C(Q("%s"))'% i, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print ("\nAnova => Total_Amount_Paid - %s" % i)
display(anova_table)
Anova => Total_Amount_Paid - neighbourhood
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("neighbourhood")) | 3.146232e+10 | 41.0 | 0.942424 | 0.576012 |
Residual | 6.359336e+12 | 7810.0 | NaN | NaN |
Anova => Total_Amount_Paid - room_type
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("room_type")) | 1.039528e+10 | 2.0 | 6.393996 | 0.00168 |
Residual | 6.380403e+12 | 7849.0 | NaN | NaN |
Anova => Total_Amount_Paid - host_name
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("host_name")) | 1.801270e+12 | 1818.0 | 1.30199 | 4.624713e-13 |
Residual | 4.589516e+12 | 6031.0 | NaN | NaN |
Anova => Total_Amount_Paid - host_id
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("host_id")) | 6.331497e+12 | 2684.0 | 205.540313 | 0.0 |
Residual | 5.930145e+10 | 5167.0 | NaN | NaN |
Anova => Total_Amount_Paid - name
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("name")) | 6.388899e+12 | 7320.0 | 244.018545 | 0.0 |
Residual | 1.899270e+09 | 531.0 | NaN | NaN |
Anova => Total_Amount_Paid - neighbourhood_group
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Q("neighbourhood_group")) | 1.154940e+10 | 4.0 | 3.551678 | 0.006696 |
Residual | 6.379249e+12 | 7847.0 | NaN | NaN |
With our null hypothesis being Ho = no difference between population means of Total_Amount_Paid and groups(categorical columns). Above results of 1-way Anova indicates that for room_type, neighbourhood_group, name and host_id p-value turns out to be less than 0.05, so we can reject our null hypothesis thus we can state with 95% confidence that difference between means exist for sales and groups in above mentioned columns
Correlation Heatmap
import matplotlib as plt
matrix = np.triu(df.corr())
sns.heatmap(df.corr(),annot=True, fmt='.1g',vmin=-1, vmax=1, center= 0,cmap='coolwarm',mask=matrix)
<matplotlib.axes._subplots.AxesSubplot at 0x264b7e3f668>
The correlation of Total paid with price and minimum nights is quite straight forward. Availability column is somewhat related to minimum nights and host listings count. Rest of the variables are weakly correlated.