Python Data Science for Real Estate & REIT Amsterdam: (Auto) EDA, NLP, Maps & ML

Featured Image Design Template via Canva.

Amsterdam’s (AMS) Real Estate Market is experiencing an resurgence, with property prices soaring by double-digits on an yearly basis since 2013.
Can data science help us understand where we stand in the AMS housing market? Read more here.
Following the recent ML studies, let’s take a closer look at the city’s rental market landscape using data science in Python.
Project 1: Amsterdam Inside Airbnb

From the project website: http://insideairbnb.com/about/

Inside Airbnb is a mission driven project that provides data and advocacy about Airbnb’s impact on residential communities.

We work towards a vision where data and information empower communities to understand, decide and control the role of renting residential homes to tourists.

Project 2: Amsterdam House Price Prediction

The housing prices have been obtained from Pararius.nl as a snapshot in August 2021 (courtesy of Pararius). The original data provided features such as price, floor area and the number of rooms. The data has been further enhanced by utilizing the Mapbox API to obtain the coordinates of each listing.

Table of Contents

An Environment Setup
About Input Datasets in Projects 1-2
Project 1: Interactive Data Analysis with ITables
Projects 1-2: Basic Statistical Data Analysis
Projects 1-2: Exploratory Data Analysis (EDA) & ML
Project 1: SweetViz AutoEDA
Project 1: AutoViz AutoEDA
Project 1: Geospatial EDA
Project 1: NLP Wordcloud Images
Project 1: ML Regression of Review Scores
Project 2: Tuned RF Regression of Prices
Conclusions
Explore More
References
Embed Socials
Infographics

An Environment Setup

Set up a lean, robust data science environment with Miniconda and Conda-Forge
- Download and install Miniconda
- conda create -n my-conda-env
- conda install jupyter
- jupyter notebook
- For installing multiple packages from the command line, just pass them as a space-delimited list, e.g.:
- pip install package1, package2, …
- conda deactivate (exit)
Jupyter Notebook: Setting up the working directory YOURPATH

import os
os.chdir('YOURPATH')    # Set working directory
os. getcwd()

Basic Python imports and installations

import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt

import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#to make the interactive maps
import folium
from folium.plugins import FastMarkerCluster
import geopandas as gpd
from branca.colormap import LinearColormap

#to make the plotly graphs
import plotly.graph_objs as go
import chart_studio.plotly as py
from plotly.offline import iplot, init_notebook_mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

#text mining
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from wordcloud import WordCloud

About Input Datasets in Projects 1-2

Project 1: File descriptions in the netherlands/Amsterdam folder:
- listings.csv: Detailed Listings data
- calendar.csv: Detailed Calendar Data
- reviews.csv: Detailed Review Data
- listings.csv: Summary information and metrics for listings in Amsterdam (good for visualizations).
- reviews.csv: Summary Review data and Listing ID (to facilitate time based analytics and visualizations linked to a listing).
- neighbourhoods.csv Neighborhood list for geo filter. Sourced from city or open source GIS files.
- neighbourhoods.geojson GeoJSON file of neighborhoods of the city.
Project 2: The file HousingPrices-Amsterdam-August-2021.csv contains 8 columns with 919 unique values (Unnamed: 0, Address, Zip, Price, Area, Room, Lon, Lat).

Project 1: Interactive Data Analysis with ITables

Reading the datasets that belong to Project 1

listings1 = pd.read_csv('listings.csv',low_memory=False)
listings2 = pd.read_csv('listings_details.csv',low_memory=False)
calendar1 = pd.read_csv('calendar.csv',low_memory=False)
neighb=pd.read_csv('neighbourhoods.csv',low_memory=False)
reviews1 = pd.read_csv('reviews.csv',low_memory=False)
reviews2 = pd.read_csv('reviews_details.csv',low_memory=False)

listings1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20030 entries, 0 to 20029
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20030 non-null  int64  
 1   name                            19992 non-null  object 
 2   host_id                         20030 non-null  int64  
 3   host_name                       20026 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   20030 non-null  object 
 6   latitude                        20030 non-null  float64
 7   longitude                       20030 non-null  float64
 8   room_type                       20030 non-null  object 
 9   price                           20030 non-null  int64  
 10  minimum_nights                  20030 non-null  int64  
 11  number_of_reviews               20030 non-null  int64  
 12  last_review                     17624 non-null  object 
 13  reviews_per_month               17624 non-null  float64
 14  calculated_host_listings_count  20030 non-null  int64  
 15  availability_365                20030 non-null  int64  
dtypes: float64(4), int64(7), object(5)
memory usage: 2.4+ MB

listings2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20030 entries, 0 to 20029
Data columns (total 96 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                20030 non-null  int64  
 1   listing_url                       20030 non-null  object 
 2   scrape_id                         20030 non-null  int64  
 3   last_scraped                      20030 non-null  object 
 4   name                              19992 non-null  object 
 5   summary                           19510 non-null  object 
 6   space                             14579 non-null  object 
 7   description                       19906 non-null  object 
 8   experiences_offered               20030 non-null  object 
 9   neighborhood_overview             13257 non-null  object 
 10  notes                             9031 non-null   object 
 11  transit                           13635 non-null  object 
 12  access                            12227 non-null  object 
 13  interaction                       11974 non-null  object 
 14  house_rules                       12571 non-null  object 
 15  thumbnail_url                     0 non-null      float64
 16  medium_url                        0 non-null      float64
 17  picture_url                       20030 non-null  object 
 18  xl_picture_url                    0 non-null      float64
 19  host_id                           20030 non-null  int64  
 20  host_url                          20030 non-null  object 
 21  host_name                         20026 non-null  object 
 22  host_since                        20026 non-null  object 
 23  host_location                     19991 non-null  object 
 24  host_about                        11803 non-null  object 
 25  host_response_time                10547 non-null  object 
 26  host_response_rate                10547 non-null  object 
 27  host_acceptance_rate              0 non-null      float64
 28  host_is_superhost                 20026 non-null  object 
 29  host_thumbnail_url                20026 non-null  object 
 30  host_picture_url                  20026 non-null  object 
 31  host_neighbourhood                14222 non-null  object 
 32  host_listings_count               20026 non-null  float64
 33  host_total_listings_count         20026 non-null  float64
 34  host_verifications                20030 non-null  object 
 35  host_has_profile_pic              20026 non-null  object 
 36  host_identity_verified            20026 non-null  object 
 37  street                            20030 non-null  object 
 38  neighbourhood                     18377 non-null  object 
 39  neighbourhood_cleansed            20030 non-null  object 
 40  neighbourhood_group_cleansed      0 non-null      float64
 41  city                              20026 non-null  object 
 42  state                             19903 non-null  object 
 43  zipcode                           19164 non-null  object 
 44  market                            19988 non-null  object 
 45  smart_location                    20030 non-null  object 
 46  country_code                      20030 non-null  object 
 47  country                           20030 non-null  object 
 48  latitude                          20030 non-null  float64
 49  longitude                         20030 non-null  float64
 50  is_location_exact                 20030 non-null  object 
 51  property_type                     20030 non-null  object 
 52  room_type                         20030 non-null  object 
 53  accommodates                      20030 non-null  int64  
 54  bathrooms                         20020 non-null  float64
 55  bedrooms                          20022 non-null  float64
 56  beds                              20023 non-null  float64
 57  bed_type                          20030 non-null  object 
 58  amenities                         20030 non-null  object 
 59  square_feet                       406 non-null    float64
 60  price                             20030 non-null  object 
 61  weekly_price                      2843 non-null   object 
 62  monthly_price                     1561 non-null   object 
 63  security_deposit                  13864 non-null  object 
 64  cleaning_fee                      16401 non-null  object 
 65  guests_included                   20030 non-null  int64  
 66  extra_people                      20030 non-null  object 
 67  minimum_nights                    20030 non-null  int64  
 68  maximum_nights                    20030 non-null  int64  
 69  calendar_updated                  20030 non-null  object 
 70  has_availability                  20030 non-null  object 
 71  availability_30                   20030 non-null  int64  
 72  availability_60                   20030 non-null  int64  
 73  availability_90                   20030 non-null  int64  
 74  availability_365                  20030 non-null  int64  
 75  calendar_last_scraped             20030 non-null  object 
 76  number_of_reviews                 20030 non-null  int64  
 77  first_review                      17624 non-null  object 
 78  last_review                       17624 non-null  object 
 79  review_scores_rating              17391 non-null  float64
 80  review_scores_accuracy            17381 non-null  float64
 81  review_scores_cleanliness         17383 non-null  float64
 82  review_scores_checkin             17369 non-null  float64
 83  review_scores_communication       17378 non-null  float64
 84  review_scores_location            17370 non-null  float64
 85  review_scores_value               17371 non-null  float64
 86  requires_license                  20030 non-null  object 
 87  license                           9 non-null      object 
 88  jurisdiction_names                19358 non-null  object 
 89  instant_bookable                  20030 non-null  object 
 90  is_business_travel_ready          20030 non-null  object 
 91  cancellation_policy               20030 non-null  object 
 92  require_guest_profile_picture     20030 non-null  object 
 93  require_guest_phone_verification  20030 non-null  object 
 94  calculated_host_listings_count    20030 non-null  int64  
 95  reviews_per_month                 17624 non-null  float64
dtypes: float64(21), int64(13), object(62)
memory usage: 14.7+ MB

Making our Pandas DataFrames interactive with ITables 2.0

from itables import init_notebook_mode

init_notebook_mode(all_interactive=True)
from itables import show
show(listings1, buttons=["copyHtml5", "csvHtml5", "excelHtml5"])

This packages changes how Pandas and Polars DataFrames are rendered in Jupyter Notebooks. With itables you can display your tables as interactive DataTables that you can sort, paginate, scroll or filter.
ITables is just about how tables are displayed. You can turn it on and off in just two lines, with no other impact on your data workflow.

Projects 1-2: Basic Statistical Data Analysis

Project 1:

ddf = pd.read_csv("listings.csv", index_col= "id")
print(ddf.shape)
(20030, 15)
print(ddf.columns)
Index(['name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood',
       'latitude', 'longitude', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')
ddf.describe().T

Project 2:

ddf1 = pd.read_csv('HousingPrices-Amsterdam-August-2021.csv')
ddf1.head()

HousingPrices-Amsterdam-August-2021 input data table

print(ddf1.shape)
(924, 8)
print(ddf1.columns)
Index(['Unnamed: 0', 'Address', 'Zip', 'Price', 'Area', 'Room', 'Lon', 'Lat'], dtype='object')
ddf1.describe().T

Descriptive statistic of price dataset 2.

Projects 1-2: Exploratory Data Analysis (EDA) & ML

Project 1

Input data preparation

listings = pd.read_csv("listings.csv", index_col= "id")
target_columns = ["property_type", "accommodates", "first_review", "review_scores_value", "review_scores_cleanliness", "review_scores_location", "review_scores_accuracy", "review_scores_communication", "review_scores_checkin", "review_scores_rating", "maximum_nights", "listing_url", "host_is_superhost", "host_about", "host_response_time", "host_response_rate", "street", "weekly_price", "monthly_price", "market"]
listings = pd.merge(listings, listings_details[target_columns], on='id', how='left')
listings = listings.drop(columns=['neighbourhood_group'])
listings['host_response_rate'] = pd.to_numeric(listings['host_response_rate'].str.strip('%'))

Number of listings by neighborhood

feq=listings['neighbourhood'].value_counts().sort_values(ascending=True)
feq.plot.barh(figsize=(10, 8), color='b', width=1)
plt.title("Number of listings by neighbourhood", fontsize=20)
plt.xlabel('Number of listings', fontsize=12)
plt.show()

AMS Open Street Location Map: number of listings per neighborhood

lats2018 = listings['latitude'].tolist()
lons2018 = listings['longitude'].tolist()
locations = list(zip(lats2018, lons2018))

map1 = folium.Map(location=[52.3680, 4.9036], zoom_start=11.5)
FastMarkerCluster(data=locations).add_to(map1)
map1

AMS Open Street Location Map: number of listings per neighborhood

Room type counts

freq = listings['room_type']. value_counts().sort_values(ascending=True)
freq.plot.barh(figsize=(15, 3), width=1, color = ["g","b","r"])
plt.show()

Check unique property types in our listings

listings.property_type.unique()
array(['Apartment', 'Townhouse', 'Houseboat', 'Bed and breakfast', 'Boat',
       'Guest suite', 'Loft', 'Serviced apartment', 'House',
       'Boutique hotel', 'Guesthouse', 'Other', 'Condominium', 'Chalet',
       'Nature lodge', 'Tiny house', 'Hotel', 'Villa', 'Cabin',
       'Lighthouse', 'Bungalow', 'Hostel', 'Cottage', 'Tent',
       'Earth house', 'Campsite', 'Castle', 'Camper/RV', 'Barn',
       'Casa particular (Cuba)', 'Aparthotel'], dtype=object)

Comparing property types in Amsterdam

prop = listings.groupby(['property_type','room_type']).room_type.count()
prop = prop.unstack()
prop['total'] = prop.iloc[:,0:3].sum(axis = 1)
prop = prop.sort_values(by=['total'])
prop = prop[prop['total']>=100]
prop = prop.drop(columns=['total'])

prop.plot(kind='barh',stacked=True, color = ["r","b","g"],
              linewidth = 1, grid=True, figsize=(15,8), width=1)
plt.title('Property types in Amsterdam', fontsize=18)
plt.xlabel('Number of listings', fontsize=14)
plt.ylabel("")
plt.legend(loc = 4,prop = {"size" : 13})
plt.rc('ytick', labelsize=13)
plt.show()

Accommodates (number of people)

feq=listings['accommodates'].value_counts().sort_index()
feq.plot.bar(figsize=(10, 8), color='b', width=1, rot=0)
plt.title("Accommodates (number of people)", fontsize=20)
plt.ylabel('Number of listings', fontsize=12)
plt.xlabel('Accommodates', fontsize=12)
plt.show()

Average daily price for a 2-persons accommodation

feq = listings[listings['accommodates']==2]
feq = feq.groupby('neighbourhood')['price'].mean().sort_values(ascending=True)
feq.plot.barh(figsize=(10, 8), color='b', width=1)
plt.title("Average daily price for a 2-persons accommodation", fontsize=20)
plt.xlabel('Average daily price (Euro)', fontsize=12)
plt.ylabel("")
plt.show()

Average daily price for a 2-persons accommodation

City map average price per neighborhood

adam = gpd.read_file("neighbourhoods.geojson")
feq = pd.DataFrame([feq])
feq = feq.transpose()
adam = pd.merge(adam, feq, on='neighbourhood', how='left')
adam.rename(columns={'price': 'average_price'}, inplace=True)
adam.average_price = adam.average_price.round(decimals=0)

map_dict = adam.set_index('neighbourhood')['average_price'].to_dict()
color_scale = LinearColormap(['yellow','red'], vmin = min(map_dict.values()), vmax = max(map_dict.values()))

def get_color(feature):
    value = map_dict.get(feature['properties']['neighbourhood'])
    return color_scale(value)

map3 = folium.Map(location=[52.3680, 4.9036], zoom_start=11)
folium.GeoJson(data=adam,
               name='Amsterdam',
               tooltip=folium.features.GeoJsonTooltip(fields=['neighbourhood', 'average_price'],
                                                      labels=True,
                                                      sticky=False),
               style_function= lambda feature: {
                   'fillColor': get_color(feature),
                   'color': 'black',
                   'weight': 1,
                   'dashArray': '5, 5',
                   'fillOpacity':0.5
                   },
               highlight_function=lambda feature: {'weight':3, 'fillColor': get_color(feature), 'fillOpacity': 0.8}).add_to(map3)
map3

Average review score location (at least 10 reviews) vs Average daily price for a 2-persons accommodation

fig = plt.figure(figsize=(20,10))
plt.rc('xtick', labelsize=16)
plt.rc('ytick', labelsize=20)

ax1 = fig.add_subplot(121)
feq = listings[listings['number_of_reviews']>=10]
feq1 = feq.groupby('neighbourhood')['review_scores_location'].mean().sort_values(ascending=True)
ax1=feq1.plot.barh(color='b', width=1)
plt.title("Average review score location (at least 10 reviews)", fontsize=20)
plt.xlabel('Score (scale 1-10)', fontsize=20)
plt.ylabel("")

ax2 = fig.add_subplot(122)
feq = listings[listings['accommodates']==2]
feq2 = feq.groupby('neighbourhood')['price'].mean().sort_values(ascending=True)
ax2=feq2.plot.barh(color='b', width=1)
plt.title("Average daily price for a 2-persons accommodation", fontsize=20)
plt.xlabel('Average daily price (Euro)', fontsize=20)
plt.ylabel("")

plt.tight_layout()
plt.show()

Average review score location (at least 10 reviews) vs Average daily price for a 2-persons accommodation

Number of listings available by date

sum_available = calendar[calendar.available == "t"].groupby(['date']).size().to_frame(name= 'available').reset_index()
sum_available['weekday'] = sum_available['date'].dt.day_name()
sum_available = sum_available.set_index('date')

sum_available.iplot(y='available', mode = 'lines', xTitle = 'Date', yTitle = 'number of listings available',\
                   text='weekday', title = 'Number of listings available by date')

Project 2

Let’s consider the Amsterdam House Price Prediction project, including data processing, EDA, feature engineering, and ML regression. Read more here.
Basic imports and reading the input dataset

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
df = pd.read_csv('HousingPrices-Amsterdam-August-2021.csv')
df.head()

The Amsterdam House Price Prediction dataset.

Plotting the feature correlation heatmap

sns.heatmap(df.corr())

Project 2: input data feature correlation heatmap.

Removing undefined values while examining the price outliers with boxplots

df = df.dropna(axis = 0, inplace = False)
sns.boxplot(x='Price', data = df)

Removing price outliers with the IQR thresholds

q1 = df.describe()['Price']['25%']
q3 = df.describe()['Price']['75%']
iqr = q3 - q1
max_price = q3 + 1.5 * iqr 
outliers = df[df['Price'] >= max_price]
outliers_count = outliers['Price'].count()
df_count = df['Price'].count()
print('Percentage removed: ' + str(round(outliers_count/df_count * 100, 2)) + '%')
Percentage removed: 7.72%
df= df[df['Price'] <= max_price]
sns.boxplot(x='Price', data = df)

Input data editing and plotting the modified feature correlation heatmap

df['Zip No'] = df['Zip'].apply(lambda x:x.split()[0])
df['Letters'] = df['Zip'].apply(lambda x:x.split()[-1])
def word_separator(string):
    list = string.split()
    word = []
    number = [] 
    for element in list:
        if element.isalpha() == True: 
            word.append(element)
        else:
            break
    word = ' '.join(word)
    return word
df['Street'] = df['Address'].apply(lambda x:word_separator(x))
numerical = ['Price', 'Area', 'Room', 'Lon', 'Lat']
categorical = ['Address', 'Zip No', 'Letters', 'Street']
from sklearn.preprocessing import LabelEncoder
for c in categorical:
    lbl = LabelEncoder() 
    lbl.fit(list(df[c].values)) 
    df[c] = lbl.transform(list(df[c].values))
df.drop(['Zip', 'Unnamed: 0', 'Address'], axis =1, inplace = True)
sns.heatmap(df.corr())

Project 2: input data feature correlation heatmap after data editing.

Preparing data for training and testing supervised ML models

from sklearn.model_selection import train_test_split
X = df.drop('Price', axis =1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Linear Regression (LR)

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
predictions = linreg.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('Linear Regression',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with linear regression

Lasso regression

from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('Lasso',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with Lasso regression

ElasticNet regression

from sklearn.linear_model import ElasticNet
elasticnet = ElasticNet()
elasticnet.fit(X_train, y_train)
predictions = elasticnet.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('ElasticNet',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with ElasticNet regression

Ridge regression

from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('Ridge',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with Ridge regression

Random Forest (RF) Regressor

from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('Random Forest',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with Random Forest regression

XGBoost Regressor

from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)
plt.scatter(y_test,predictions)
plt.title('XGBoost',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

House price prediction with XGBoost regression

Random Forest Hyperparameter Optimization (HPO)

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

random_grid = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

random_cv = RandomizedSearchCV(estimator = random_forest, param_distributions = random_grid, n_iter = 100, cv = 10, verbose = 2, n_jobs = -1)
random_cv.fit(X_train, y_train)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits
RandomizedSearchCV(cv=10, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   verbose=2)

random_cv.best_params_ 
{'n_estimators': 1400,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}

param_grid = {'bootstrap': [True, False],
'max_depth': [60,65,70,75,80],
'min_samples_leaf':[1,2,3],
'min_samples_split': [1,2,3],
'n_estimators': [1750,1760,1770,1780,1790,1800,1810,1820,1830,1840,1850]}
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

grid_search.fit(X_train,y_train)

Fitting 3 folds for each of 990 candidates, totalling 2970 fits
GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_depth': [60, 65, 70, 75, 80],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [1, 2, 3],
                         'n_estimators': [1750, 1760, 1770, 1780, 1790, 1800,
                                          1810, 1820, 1830, 1840, 1850]},
             verbose=2)

grid_search.best_params_
{'bootstrap': True,
 'max_depth': 65,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1840}

tuned_random_forest = RandomForestRegressor(n_estimators = 1750, max_depth = 80, min_samples_leaf = 1, min_samples_split = 2)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

cv = cross_val_score(tuned_random_forest, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print("The Random Forest Regressor with tuned parameters has a RMSE of: " + str(abs(cv.mean())**0.5))
The Random Forest Regressor with tuned parameters has a RMSE of: 95830.77429927936

# Calculate testing scores
from sklearn.metrics import mean_absolute_error, r2_score
test_mae = mean_absolute_error(y_test, predictions)
test_mse = mean_squared_error(y_test, predictions)
test_rmse = mean_squared_error(y_test, predictions, squared=False)
test_r2 = r2_score(y_test, predictions)
print(test_mae)
65426.81317647059
print(test_mse)
9152415095.08142
print(test_rmse)
95668.25541986966
print(test_r2)
0.8319334007165564

Project 1: SweetViz AutoEDA

Project 1: Reading input data & displaying the sweetviz EDA HTML report

listings = pd.read_csv('listings.csv')

# importing sweetviz
import sweetviz as sv
#analyzing the dataset
advert_report = sv.analyze(listings)
#display the report
advert_report.show_html('listings_sv.html')

SweetViz Report of listings: overview and columns 1-4.

SweetViz Report of listings: Column 5 of 100% missing values

SweetViz Report of listings: Columns 6-9

SweetViz Report of listings: Columns 10-13

SweetViz Report of listings: Columns 14-16

Associations [Only including dataset “DataFrame”]
■ Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is asymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP).
Circles are the symmetrical numerical correlations (Pearson’s) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.

Project 1: AutoViz AutoEDA

Project 1: Let’s explore the AutoViz EDA library. With just one line of code, you can effortlessly generate multiple informative plots, while addressing Data Quality issues.
Defining the input csv file and loading the AutoViz Class

filename = "listings.csv"
target_variable = "name"
#Load Autoviz
from autoviz import AutoViz_Class
%matplotlib inline

AV = AutoViz_Class()
dft = AV.AutoViz(
    filename,
    sep=",",
    depVar=target_variable,
    dfte=None,
    header=0,
    verbose=2,
    lowess=False,
    chart_format="svg",
    max_rows_analyzed=500,
    max_cols_analyzed=20,
    save_plot_dir=None
)
from autoviz import FixDQ
fixdq = FixDQ()

AutoViz Log Output

    max_rows_analyzed is smaller than dataset shape 20030...
        randomly sampled 500 rows from read CSV file
Shape of your Data Set loaded: (500, 16)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
  Printing upto 30 columns (max) in each category:
    Numeric Columns : ['latitude', 'longitude', 'reviews_per_month']
    Integer-Categorical Columns: ['host_id', 'price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'availability_365']
    String-Categorical Columns: ['neighbourhood']
    Factor-Categorical Columns: []
    String-Boolean Columns: ['room_type']
    Numeric-Boolean Columns: []
    Discrete String Columns: ['host_name', 'last_review']
    NLP text Columns: []
    Date Time Columns: []
    ID Columns: ['id']
    Columns that will not be considered in modeling: ['neighbourhood_group']
    15 Predictors classified...
        2 variable(s) removed since they were ID or low-information variables
        List of variables removed: ['id', 'neighbourhood_group']
Since Number of Rows in data 500 exceeds maximum, randomly sampling 500 rows for EDA...

################ Multi_Classification problem #####################
   Columns to delete:
"   ['neighbourhood_group']"
   Boolean variables %s 
"   ['room_type']"
   Categorical variables %s 
("   ['neighbourhood', 'host_id', 'price', 'minimum_nights', "
 "'number_of_reviews', 'calculated_host_listings_count', 'availability_365', "
 "'room_type']")
   Continuous variables %s 
"   ['latitude', 'longitude', 'reviews_per_month']"
   Discrete string variables %s 
"   ['host_name', 'last_review']"
   Date and time variables %s 
'   []'
   ID variables %s 
"   ['id']"
   Target variable %s 
'   name'
To fix these data quality issues in the dataset, import FixDQ from autoviz...
    All variables classified into correct types.

AutoViz HTML Plots in the new sub-directory name:
- Bar Plots

Box Plots

Dist Plots Numerics

Density histogram of minimum nights vs distribution of room types

Heat Map

Project 1: Geospatial EDA

Let’s follow the Plotly Geospatial EDA.
Basic imports and downloads

import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt

import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#to make the interactive maps
import folium
from folium.plugins import FastMarkerCluster
import geopandas as gpd
from branca.colormap import LinearColormap

#to make the plotly graphs
import plotly.graph_objs as go
import chart_studio.plotly as py
from plotly.offline import iplot, init_notebook_mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

#text mining
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from wordcloud import WordCloud
from json.decoder import JSONDecoder
import warnings


warnings.simplefilter(action='ignore', category=FutureWarning)


import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud


nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('omw-1.4', quiet=True)

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english') + ['ha', 'wa', 'br', 'b'])

Preprocessing and plotting functions

def preprocess(text):
    text = list(filter(str.isalpha, word_tokenize(text)))
    text = list(lemmatizer.lemmatize(word) for word in text)
    text = list(word for word in text if word not in stop_words)
    return ' '.join(text)


def draw_wordcloud(texts, max_words=1000, width=1000, height=500):
    wordcloud = WordCloud(background_color='white', max_words=max_words,
                          width=width, height=height)
    joint_texts = ' '.join(list(texts))
    wordcloud.generate(joint_texts)
    return wordcloud.to_image()


def draw_choropleth(neighbourhoods_geojson,feature, stat='mean', only_reviewed=False, title=None):
    stats = {
        'mean': pd.api.typing.DataFrameGroupBy.mean,
        'median': pd.api.typing.DataFrameGroupBy.median,
        'sum': pd.api.typing.DataFrameGroupBy.sum,
    }
    gb = stats[stat]((listings_reviewed if only_reviewed else listings).groupby(by='neighbourhood', as_index=False), feature)
    fig = px.choropleth_mapbox(gb, geojson=neighbourhoods_geojson, color=feature,
                           locations="neighbourhood", featureidkey="properties.neighbourhood",
                           center={"lat": 52.3676, "lon": 4.9041}, title=title or f'{feature} by neighbourhood',
                           mapbox_style="carto-positron", zoom=10, opacity=0.5)
    return fig.show()

Reading and preparing the input data of Project 1

calendar = pd.read_csv('calendar.csv')
listings = pd.read_csv('listings.csv')
listings_detailed = pd.read_csv('listings_details.csv')
reviews = pd.read_csv('reviews.csv')
reviews_detailed = pd.read_csv('reviews_details.csv')
neighbourhoods = pd.read_csv('neighbourhoods.csv')
listings = listings_detailed.drop(columns=['neighbourhood']).rename(columns={'neighbourhood_cleansed': 'neighbourhood'})  # we will only need neighbourhood_cleansed
reviews = reviews_detailed
for column in ['host_since', 'first_review', 'last_review']:
    listings[column] = pd.to_datetime(listings[column], format='%Y-%m-%d')
    listings[column].dt.day.describe()

calendar.date = pd.to_datetime(calendar.date, format='%Y-%m-%d')

listings.price = listings.price.replace('[\$,]', '', regex=True).astype(float)
calendar.price = calendar.price.replace('[\$,]', '', regex=True).astype(float)

Bar plot Number of Listings by Neighborhood

fig = px.histogram(listings, x="neighbourhood", category_orders={'neighbourhood': list(listings.neighbourhood.value_counts().index)}, title='Number of Listings by Neighbourhood')
fig.show()

Bar plot Number of Listings by Neighborhood

Preparing data for geospatial mapping

adam = gpd.read_file("neighbourhoods.geojson")
adam.head()
neighbourhood	neighbourhood_group	geometry
0	Bijlmer-Oost	None	MULTIPOLYGON Z (((4.99167 52.32444 43.06929, 4...
1	Noord-Oost	None	MULTIPOLYGON Z (((5.07916 52.38865 42.95663, 5...
2	Noord-West	None	MULTIPOLYGON Z (((4.93072 52.41161 42.91539, 4...
3	Oud-Noord	None	MULTIPOLYGON Z (((4.95242 52.38983 42.95411, 4...
4	IJburg - Zeeburgereiland	None	MULTIPOLYGON Z (((5.03906 52.35458 43.01664, 5...

gb = listings.neighbourhood.value_counts().reset_index()
gb.head()
index	neighbourhood
0	De Baarsjes - Oud-West	3515
1	De Pijp - Rivierenbuurt	2493
2	Centrum-West	2326
3	Centrum-Oost	1730
4	Westerpark	1490

Plotting Size by Neighborhood

gb = listings.neighbourhood.value_counts().reset_index()
fig = px.choropleth_mapbox(gb, geojson=adam, color='neighbourhood',
                       locations="index", featureidkey="properties.neighbourhood",
                       center={"lat": 52.3676, "lon": 4.9041}, title=f'Size by Neighbourhood',
                       mapbox_style="carto-positron", zoom=10, opacity=0.5)
fig.show()

Printing top 10 neighborhoods

top10_neighbourhoods = list(listings.neighbourhood.value_counts()[:10].index)
top10_neighbourhoods
['De Baarsjes - Oud-West',
 'De Pijp - Rivierenbuurt',
 'Centrum-West',
 'Centrum-Oost',
 'Westerpark',
 'Zuid',
 'Oud-Oost',
 'Bos en Lommer',
 'Oostelijk Havengebied - Indische Buurt',
 'Oud-Noord']

Animated Neighborhoods Yearly Growth

listings_reviewed = listings[listings.number_of_reviews > 0]

listings_reviewed.loc[:, 'first_review_year'] = listings_reviewed['first_review'].dt.year
gb = listings_reviewed.groupby(by=['first_review_year', 'neighbourhood'], as_index=False).size()
gb.first_review_year = gb.first_review_year.astype(int)
fig = px.choropleth_mapbox(gb, geojson=adam, color='size',
                       locations="neighbourhood", featureidkey="properties.neighbourhood",
                       center={"lat": 52.3676, "lon": 4.9041}, title='Neighbourhoods Yearly Growth',
                       mapbox_style="carto-positron", zoom=10, opacity=0.5, animation_frame='first_review_year')
fig.show()

Animated Neighborhoods Yearly Growth 2009

Animated Neighborhoods Yearly Growth 2010

Animated Neighborhoods Yearly Growth 2012

Animated Neighborhoods Yearly Growth 2016

Animated Neighborhoods Yearly Growth 2018

Neighborhoods Cumulative Yearly Growth

res = listings_reviewed.copy()
for year in range(2009, 2023):
    listings_at_year = listings_reviewed[listings_reviewed.first_review_year == year]
    ls = [res]
    for future_year in range(year + 1, 2024):
        l = listings_at_year.copy()
        l.first_review_year = future_year
        ls.append(l)
    res = pd.concat(ls)
fig = px.scatter_mapbox(res.sort_values('first_review_year', ascending=True), lat='latitude', lon='longitude', center={"lat": 52.3676, "lon": 4.9041}, #color="peak_hour", size="car_hours",
                        zoom=10, mapbox_style="carto-positron", animation_frame='first_review_year', opacity=0.25, title='Neighbourhoods Cumulative Yearly Growth')
fig.show()

Neighborhoods Cumulative Yearly Growth 2011

Neighborhoods Cumulative Yearly Growth 2014

Neighborhoods Cumulative Yearly Growth 2017

Neighborhoods Cumulative Yearly Growth 2020

Total Number of Reviews by Neighborhood (~Total Tourists Volume)

draw_choropleth(adam,'number_of_reviews', 'sum', title='Total Number of Reviews by Neighbourhood (~Total Tourists Volume)')

Total Number of Reviews by Neighborhood (~Total Tourists Volume)

Total Number of Reviews Per Month by Neighborhood (~Monthly Airbnb Guests Volume)

listings_reviewed['lifetime_in_months'] = ((listings_reviewed.last_review - listings_reviewed.first_review)/np.timedelta64(1, 'D'))/30 + 1/30
listings_reviewed['load'] = listings_reviewed['number_of_reviews'] / np.ceil(listings_reviewed['lifetime_in_months']
draw_choropleth(adam,'load', 'sum', only_reviewed=True, title='Total Number of Reviews Per Month by Neighbourhood (~Monthly Airbnb Guests Volume)')

Total Number of Reviews Per Month by Neighborhood (~Monthly Airbnb Guests Volume)

Median Number of Reviews Per Month by Neighborhood (~Listing Busyness)

draw_choropleth(adam,'load', 'median', only_reviewed=True, title='Median Number of Reviews Per Month by Neighbourhood (~Listing Busyness)')

Median Number of Reviews Per Month by Neighborhood (~Listing Busyness)

Median Price by Neighborhood

draw_choropleth(adam,'price', 'median', title='Median Price by Neighbourhood')

Violin Plot of Price Distribution by Neighborhood (top 10 areas)

listings_in_top10_neighbourhoods = listings[listings.neighbourhood.isin(top10_neighbourhoods)]
len(listings_in_top10_neighbourhoods)
16952
fig = px.violin(listings_in_top10_neighbourhoods, y="price", x="neighbourhood", log_y=False, range_y=[-10, 2000], points="all", box=True, title='Price Distribution by Neighbourhood',
               category_orders={'neighbourhood': list(listings_in_top10_neighbourhoods.groupby('neighbourhood')['price'].aggregate('median').reset_index().sort_values(by='price')['neighbourhood'])}
)
fig.show()

Violin Plot of Price Distribution by Neighborhood (top 10 areas)

Median Location Scores by Neighborhood

draw_choropleth(adam,'review_scores_location', 'median', title='Median Location Scores by Neighbourhood')

Violin Plot of Location Score Distribution by Neighborhood

fig = px.violin(listings, y="review_scores_location", x="neighbourhood", box=True, points="all", range_y=[1.8, 10.5], title='Location Score Distribution by Neighbourhood',
               category_orders={'neighbourhood': list(listings.groupby('neighbourhood')['review_scores_location'].aggregate('mean').reset_index().sort_values(by='review_scores_location')['neighbourhood'])}
)
fig.show()

Violin Plot of Location Score Distribution by Neighborhood

AMS Price Distribution histogram

fig = px.histogram(listings, x='price', nbins=1000, barmode='group', range_x=[0, 500], histnorm='probability', title='Price Distribution')
fig.show()

AMS Price-to-Probability Distribution histogram

fig = px.histogram(listings, x='price', nbins=100, barmode='group', range_x=[0, 500], histnorm='probability', title='Price Distribution')
fig.show()

AMS Price-to-Probability Distribution histogram

Rating by Number of Reviews scatter plot

listings_reviewed.loc[:, 'number_of_reviews_jittered'] = listings_reviewed.number_of_reviews + np.exp(np.random.randn(len(listings_reviewed)) / 10)
listings_reviewed.loc[:, 'review_scores_rating_jittered'] = listings_reviewed.review_scores_rating + np.random.randn(len(listings_reviewed)) / 10
fig = px.scatter(listings_reviewed[listings_reviewed.price < 1000], x='number_of_reviews_jittered', y='review_scores_rating_jittered', color='lifetime_in_months', log_x=True, opacity=0.25, marginal_x='box', marginal_y='histogram', title='Rating by Number of Reviews')
fig.show()

Rating by Number of Reviews scatter plot

Review Scores Correlations heatmap

aspect_scores_feats = ['review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']
scores_feats = ['review_scores_rating'] + aspect_scores_feats
px.imshow(listings[scores_feats].corr(), title='Review Scores Correlations')

Number of Reviews per Month by Price

listings_reviewed.loc[:, 'load_jittered'] = listings_reviewed.load + np.random.randn(len(listings_reviewed)) / 5
listings_reviewed.loc[:, 'price_jittered'] = listings_reviewed.price + np.random.randn(len(listings_reviewed)) * 2

px.histogram(listings_reviewed, x='price', y='load', histfunc='avg', range_x=[0, 300], range_y=[0, 4], nbins=2500, title='Number of Reviews per Month by Price').show()

Number of Reviews per Month by Price Histogram

Number of Reviews by Active Lifetime scatter plot

px.scatter(listings_reviewed, x='lifetime_in_months', y='number_of_reviews', color='price', range_color=[0, 300], range_y=[0, 600], opacity=0.5, title='Number of Reviews by Active Lifetime').show()

Number of Reviews by Active Lifetime scatter plot

Active Lifetime In Months histogram vs boxplot

fig = px.histogram(listings_reviewed, x='lifetime_in_months', nbins=int(listings_reviewed.lifetime_in_months.max()), barmode='group', marginal='box', title='Active Lifetime In Months')
fig.show()

Active Lifetime In Months histogram vs boxplot

Listings Lifetimes Scatter

fig = px.scatter(listings, x='first_review', y='last_review', marginal_x='histogram', marginal_y='histogram', title='Listings Lifetimes Scatter')
fig.show()

Listing Types

fig = px.histogram(listings, x='property_type', color='room_type', barmode='group', title='Listing Types')
fig.show()

Price by Capacity boxplots

fig = px.box(listings, x='accommodates', y='price', range_y=[0, 1000], range_x=[0, 9], title='Price by Capacity')
fig.show()

Project 1: NLP Wordcloud Images

Let’s generate a word cloud image of the aforementioned listings.
Preparing the input text data

listings['listing_name'] = listings.name.astype('string')
print(listings['listing_name'])
0                 Quiet Garden View Room & Super Fast WiFi
1                        Quiet apt near center, great view
2               100%Centre-Studio 1 Private Floor/Bathroom
3                      Lovely apt in City Centre (Jordaan)
4        Romantic, stylish B&B houseboat in canal district
                               ...                        
20025     Family House City + free Parking+garden (160 m2)
20026                    Home Sweet Home in Indische Buurt
20027               Amsterdam Cozy apartment nearby center
20028              Home Sweet Home for a Guest or a Couple
20029          Cosy two bedroom appartment near 'de Pijp'!
Name: listing_name, Length: 20000, dtype: string

txt=listings['listing_name'].str.cat(sep=' ')

Creating and generating the word cloud

# Create and generate a word cloud image:

wordcloud = WordCloud().generate(txt)
type(txt)
str
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Using stop words

STOPWORDS = nltk.corpus.stopwords.words('english')
# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["Amsterdam", "city", "Beautiful","centre",'center'])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(txt)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The word cloud image after removing stop words

Examining the Length of Reviews

reviews = reviews.dropna()
reviews.loc[:, 'length'] = reviews.comments.str.len()
fig = px.histogram(reviews[reviews.length != 0], x='length', nbins=1000, barmode='group', title='Length of Reviews')
fig.show()

Restricting the Length of Reviews by 750

reviews = reviews[reviews.length < 750]
reviews_with_rating = reviews.join(listings[['id', 'review_scores_rating']].set_index('id'), on='listing_id', validate='m:1')
txt1=reviews_with_rating['comments'].str.cat(sep=' ')

Updating the word cloud image

#wordcloud = WordCloud().generate(txt1)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(txt1)
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The word cloud image after removing stop words and excluding long reviews > 750

Project 1: ML Regression of Review Scores

Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pprint import pprint

def calc_prediction_quality(X_feats, y_feat):
    df = listings[X_feats + [y_feat]].dropna(how='any')
    X_train, X_test, y_train, y_test = train_test_split(df[X_feats], df[y_feat], test_size=0.33, random_state=42)
    lr = LinearRegression().fit(X_train, y_train)
    mse = mean_squared_error(y_test, lr.predict(X_test))
    from sklearn.metrics import r2_score
    r2score=r2_score(y_test, lr.predict(X_test))
    print(f'LR R2: {r2score}')
    print(f'constant model MSE: {mean_squared_error(y_test, [y_train.mean()] * len(y_test))}')
    print(f'LR MSE: {mse}')
    print(f'LR intercept: {lr.intercept_}')
    print(f'LR weights:')
    pprint(dict(zip(aspect_scores_feats, lr.coef_)))

calc_prediction_quality(aspect_scores_feats, 'review_scores_rating')

LR R2: 0.6843609885371096
constant model MSE: 42.109935110705806
LR MSE: 13.289090701988293
LR intercept: 0.1273328699656986
LR weights:
{'review_scores_accuracy': 2.6342633776024216,
 'review_scores_checkin': 0.8067556404716549,
 'review_scores_cleanliness': 2.159631791937603,
 'review_scores_communication': 1.8040139905123578,
 'review_scores_location': 0.31715042347693845,
 'review_scores_value': 2.2077211691735097}

Random Forest (RF) Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pprint import pprint

def calc_prediction_quality(X_feats, y_feat):
    df = listings[X_feats + [y_feat]].dropna(how='any')
    X_train, X_test, y_train, y_test = train_test_split(df[X_feats], df[y_feat], test_size=0.33, random_state=42)
    lr = RandomForestRegressor(n_estimators=1000,max_depth=22).fit(X_train, y_train)
    mse = mean_squared_error(y_test, lr.predict(X_test))
    from sklearn.metrics import r2_score
    r2score=r2_score(y_test, lr.predict(X_test))
    print(f'RF R2: {r2score}')
    print(f'RF constant model MSE: {mean_squared_error(y_test, [y_train.mean()] * len(y_test))}')
    print(f'RF MSE: {mse}')

calc_prediction_quality(aspect_scores_feats, 'review_scores_rating')
RF R2: 0.6564884792280963
RF constant model MSE: 42.109935110705806
RF MSE: 14.462584125956381

XGBoost Regressor

from xgboost import XGBRegressor
print(xgboost.__version__)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def calc_prediction_quality(X_feats, y_feat):
    df = listings[X_feats + [y_feat]].dropna(how='any')
    X_train, X_test, y_train, y_test = train_test_split(df[X_feats], df[y_feat], test_size=0.33, random_state=42)
    model = XGBRegressor(n_estimators=1000, max_depth=17, eta=0.1, subsample=0.7, colsample_bytree=0.8)
    lr = model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, lr.predict(X_test))
    from sklearn.metrics import r2_score
    r2score=r2_score(y_test, lr.predict(X_test))
    print(f'XGB R2: {r2score}')
    print(f'XGB constant model MSE: {mean_squared_error(y_test, [y_train.mean()] * len(y_test))}')
    print(f'XGB MSE: {mse}')

calc_prediction_quality(aspect_scores_feats, 'review_scores_rating')
2.0.3
XGB R2: 0.6217210860475758
XGB constant model MSE: 42.109935110705806
XGB MSE: 15.926367196706325

SVR Algorithm

from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def calc_prediction_quality(X_feats, y_feat):
    df = listings[X_feats + [y_feat]].dropna(how='any')
    X_train, X_test, y_train, y_test = train_test_split(df[X_feats], df[y_feat], test_size=0.4, random_state=42)
    model = SVR(C=28.0, epsilon=0.2)
    lr = model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, lr.predict(X_test))
    from sklearn.metrics import r2_score
    r2score=r2_score(y_test, lr.predict(X_test))
    print(f'SVR R2: {r2score}')
    print(f'SVR constant model MSE: {mean_squared_error(y_test, [y_train.mean()] * len(y_test))}')
    print(f'SVR MSE: {mse}')

calc_prediction_quality(aspect_scores_feats, 'review_scores_rating')
SVR R2: 0.6371308133892497
SVR constant model MSE: 42.77398907443572
SVR MSE: 15.518963670618406

Decision Tree (DT) regression

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def calc_prediction_quality(X_feats, y_feat):
    df = listings[X_feats + [y_feat]].dropna(how='any')
    X_train, X_test, y_train, y_test = train_test_split(df[X_feats], df[y_feat], test_size=0.4, random_state=42)
    model = DecisionTreeRegressor(max_depth=28)
    lr = model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, lr.predict(X_test))
    from sklearn.metrics import r2_score
    r2score=r2_score(y_test, lr.predict(X_test))
    print(f'DT R2: {r2score}')
    print(f'DT constant model MSE: {mean_squared_error(y_test, [y_train.mean()] * len(y_test))}')
    print(f'DT MSE: {mse}')

calc_prediction_quality(aspect_scores_feats, 'review_scores_rating')
DT R2: 0.6202272475226304
DT constant model MSE: 42.77398907443572
DT MSE: 16.2418848616904

Comparison of R2-score for 5 regression models

import numpy as np
import matplotlib.pyplot as plt 
 
  
# creating the dataset
data = {'LR':0.684, 'RF':0.656, 'XGB':0.621, 
        'SVR':0.637,'DT':0.620}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon', 
        width = 0.4)
 
plt.xlabel("Regressor")
plt.ylabel("R2-Score")
plt.title("R2-Score of 5 Regression Models")
plt.show()

Comparison of R2-score for 5 regression models

Project 2: Tuned RF Regression of Prices

Preparing the input data for Random Forest (RF) regression of AMS house prices in 2021

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
df = pd.read_csv('HousingPrices-Amsterdam-August-2021.csv')
df = df.dropna(axis = 0, inplace = False)
q1 = df.describe()['Price']['25%']
q3 = df.describe()['Price']['75%']
iqr = q3 - q1
max_price = q3 + 1.5 * iqr 
outliers = df[df['Price'] >= max_price]
outliers_count = outliers['Price'].count()
df_count = df['Price'].count()
print('Percentage removed: ' + str(round(outliers_count/df_count * 100, 2)) + '%')
Percentage removed: 7.72%
df= df[df['Price'] <= max_price]
df['Zip No'] = df['Zip'].apply(lambda x:x.split()[0])
df['Letters'] = df['Zip'].apply(lambda x:x.split()[-1])
def word_separator(string):
    list = string.split()
    word = []
    number = [] 
    for element in list:
        if element.isalpha() == True: 
            word.append(element)
        else:
            break
    word = ' '.join(word)
    return word
df['Street'] = df['Address'].apply(lambda x:word_separator(x))
numerical = ['Price', 'Area', 'Room', 'Lon', 'Lat']
categorical = ['Address', 'Zip No', 'Letters', 'Street']

Applying Label Encoding to categorical data and dropping unused columns

from sklearn.preprocessing import LabelEncoder
for c in categorical:
    lbl = LabelEncoder() 
    lbl.fit(list(df[c].values)) 
    df[c] = lbl.transform(list(df[c].values))

df.drop(['Zip', 'Unnamed: 0', 'Address'], axis =1, inplace = True)

Train/test data splitting and scaling

from sklearn.model_selection import train_test_split
X = df.drop('Price', axis =1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Applying RandomizedSearchCV to RandomForestRegressor

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

random_grid = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

random_cv = RandomizedSearchCV(estimator = random_forest, param_distributions = random_grid, n_iter = 100, cv = 10, verbose = 2, n_jobs = -1)
random_cv.fit(X_train, y_train)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits

  RandomizedSearchCV
estimator: RandomForestRegressor

 RandomForestRegressor

Getting the best hyperparameters

random_cv.best_params_ 

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 70,
 'bootstrap': False}

Training the tuned RF model

tuned_random_forest = RandomForestRegressor(n_estimators = 1600, max_depth = 70, min_samples_leaf = 1, min_samples_split = 2)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)
cv = cross_val_score(tuned_random_forest, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print("The Random Forest Regressor with tuned parameters has a RMSE of: " + str(abs(cv.mean())**0.5))
The Random Forest Regressor with tuned parameters has a RMSE of: 92683.47375648536

# Calculate testing scores
from sklearn.metrics import mean_absolute_error, r2_score
test_mae = mean_absolute_error(y_test, predictions)
test_mse = mean_squared_error(y_test, predictions)
test_rmse = mean_squared_error(y_test, predictions, squared=False)
test_r2 = r2_score(y_test, predictions)
print(test_mae)
60217.95247058822
print(test_mse)
7459469611.399326
print(test_rmse)
86368.22107349048
print(test_r2)
0.8544722257795427

Plotting test data vs tuned RF predictions

plt.scatter(y_test,predictions)
plt.title('Tuned Random Forest',fontsize=18)
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)

Finally, let’s make the scatter X-plot, and add the regression line:

# Generate data
x = y_test
y = predictions

# Initialize layout
fig, ax = plt.subplots(figsize = (9, 9))

# Add scatterplot
ax.scatter(x, y, s=60, alpha=0.7, edgecolors="k")

# Fit linear regression via least squares with numpy.polyfit
# It returns an slope (b) and intercept (a)
# deg=1 means linear fit (i.e. polynomial of degree 1)
b, a = np.polyfit(x, y, deg=1)

# Create sequence 
xseq = x

# Plot regression line
ax.plot(xseq, a + b * xseq, color="r", lw=4);
plt.xlabel('Test Data',fontsize=18)
plt.ylabel('Prediction',fontsize=18)
plt.title('Tuned Random Forest with Regression Line',fontsize=22)

Tuned Random Forest with Regression Line

Conclusions

As data scientists, we have the power to revolutionize the REIT by developing models that can accurately analyze the real estate market trends while predicting house prices.
In this study, we have chosen Amsterdam (AMS) as the key place to focus on, and we are estimating the current market value of REIT in a number of different neighborhoods.
Project 1 reports on the comprehensive Exploratory Data Analysis (EDA) of the summary information and metrics for listings as well as ML regression of review scores in AMS.
Project 2 utilizes various features to predict housing prices in and around AMS using the Kaggle dataset.
Through descriptive statistics, we gain insight into the central tendencies and distributions of our data, while correlation analysis helps us to understand the relationships between different variables.
Our ML approach encompasses regression, model tuning and NLP algorithms deployed to facilitate investments, enhance property management, and improve customer experience in AMS real estate & REIT.
Trained with historical data, our ML systems can recognize patterns and relationships among multiple variables to predict how such parameters will affect the price-to-rent ratio, property sale price, and review scores of tenants bringing transparency to the rental market.
This case study has confirmed the great business value of data science applications in real estate that can help companies get real-time information on customer needs and interests, property valuation, and local market insights.
We have shown that real estate investors can get some excellent benefits of data science such as optimized profits vs risks, improved competitive advantage, automated operations, and increased employee efficiency.

Explore More

References

Embed Socials

A friendly reminder of how insane property prices are. Here is 400 years of data from Amsterdam. We're living through the biggest real estate bubble ever. pic.twitter.com/asP9FL4LDC
— ValuationBot (@ValuationBot) October 24, 2022

In this blog, we have discussed how to extract data from real estate websites. The objective is to scrape real estate data from Amsterdam with the help of Python.

>> https://t.co/5FdLzUycG6 #RealEstateWebsitesDataScraper #ScrapingPropertyDataPython #actowizsolutions #uk #usa pic.twitter.com/n8oGrbS3H7
— Actowiz Solutions (@actowizsolution) October 3, 2022

💼 #JobAdvert Market Information Associate (Real Estate), Amsterdam. Are you experienced in working with #RealEstate data and looking for a change? Take on a new challenge in the Netherlands as you join the diverse & growing Research & Market Information team. #Careers #DataJobs pic.twitter.com/sMkptjRfHt
— steve (@redtigerconsult) January 3, 2024

Amsterdam's Heerengracht offers a unique source of data for long-term price appreciation of high-quality residential real estate. Once adjusted for inflation, yes there is price appreciation but it pales in comparison to equities.

More: https://t.co/kvuixw8kfE pic.twitter.com/sJo7kcpEIu
— OneBagger (@onebeggar) November 20, 2020

Infographics

Real estate ML regression algorithm explained.

Real estate ML regression algorithm explained.

Supervised ML/AI linear regression of house prices: TensorFlow demo.

Supervised ML/AI linear regression of house prices

Real estate GCP ML/AI workflow deployed.

Real estate GCP ML/AI workflow deployed.

Real estate supervised ML/AI linear regression: USA house prices demo example

Real estate supervised ML/AI linear regression: USA house prices demo example

CA median house value, population and geospatial map

CA median house value, population and geospatial map

← Back

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

Or enter a custom amount

€

Your contribution is appreciated.

Donate

Donate monthly

Donate yearly

Python Data Science for Real Estate & REIT Amsterdam: (Auto) EDA, NLP, Maps & ML

An Environment Setup

About Input Datasets in Projects 1-2

Project 1: Interactive Data Analysis with ITables

Projects 1-2: Basic Statistical Data Analysis

Projects 1-2: Exploratory Data Analysis (EDA) & ML

Project 1: SweetViz AutoEDA

Project 1: AutoViz AutoEDA

Project 1: Geospatial EDA

Project 1: NLP Wordcloud Images

Project 1: ML Regression of Review Scores

Project 2: Tuned RF Regression of Prices

Conclusions

Explore More

References

Embed Socials

Infographics

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs