flowchart LR A[Receive 4000-5000 invoices daily] B[Manually check against system records] C[Match with tax masters] D[Output verified or flagged invoices] A --> B --> C --> D
graph TD; subgraph Flowchart 1 A1[Start] --> B1[Step 1]; B1 --> C1[Step 2]; C1 --> D1[End]; end subgraph Flowchart 2 A2[Start] --> B2[Step 1]; B2 --> C2[Step 2]; C2 --> D2[End]; end
graph LR; classDef title fill:none,stroke:none; T[This is the title]:::title T --> A[Start]; A --> B[Step 1]; B --> C[Step 2]; C --> D{Decision}; D -->|Yes| E[Step 3A]; D -->|No| F[Step 3B]; E --> G[End]; F --> G[End];
In the following notebooks, we will go through the implementation of each of the steps in the Machine Learning Pipeline.
We will discuss:
We will use the house price dataset available on Kaggle.com. See below for more details.
===================================================================================================
The aim of the project is to build a machine learning model to predict the sale price of homes based on different explanatory variables describing aspects of residential houses.
Predicting house prices is useful to identify fruitful investments or to determine whether the price advertised for a house is over or under-estimated.
We aim to minimise the difference between the real price and the price estimated by our model. We will evaluate model performance with the:
Instructions also in the lecture "Download Dataset" in section 1 of the course
Visit the Kaggle Website.
Remember to log in.
Scroll down to the bottom of the page, and click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset.
The download the file called 'test.csv' and save it in the directory with the notebooks.
Note the following:
Let's go ahead and load the dataset.
# to handle datasets import pandas as pd import numpy as np # for plotting import matplotlib.pyplot as plt import seaborn as sns # for the yeo-johnson transformation import scipy.stats as stats # to display all the columns of the dataframe in the notebook pd.pandas.set_option('display.max_columns', None)
# load dataset data = pd.read_csv('train.csv') # rows and columns of the data print(data.shape) # visualise the dataset data.head()
# drop id, it is just a number given to identify each house data.drop('Id', axis=1, inplace=True) data.shape
The house price dataset contains 1460 rows, that is, houses, and 80 columns, i.e., variables.
79 are predictive variables and 1 is the target variable: SalePrice
We will analyse the following:
The target variable
Variable types (categorical and numerical)
Missing data
Numerical variables
Categorical variables
Additional Reading Resources
Let's begin by exploring the target distribution.
# histogran to evaluate target distribution data['SalePrice'].hist(bins=50, density=True) plt.ylabel('Number of houses') plt.xlabel('Sale Price') plt.show()
We can see that the target is continuous, and the distribution is skewed towards the right.
We can improve the value spread with a mathematical transformation.
# let's transform the target using the logarithm np.log(data['SalePrice']).hist(bins=50, density=True) plt.ylabel('Number of houses') plt.xlabel('Log of Sale Price') plt.show()
Now the distribution looks more Gaussian.
Next, let's identify the categorical and numerical variables
# let's identify the categorical variables # we will capture those of type *object* cat_vars = [var for var in data.columns if data[var].dtype == 'O'] # MSSubClass is also categorical by definition, despite its numeric values # (you can find the definitions of the variables in the data_description.txt # file available on Kaggle, in the same website where you downloaded the data) # lets add MSSubClass to the list of categorical variables cat_vars = cat_vars + ['MSSubClass'] # number of categorical variables len(cat_vars)
# cast all variables as categorical data[cat_vars] = data[cat_vars].astype('O')
# now let's identify the numerical variables num_vars = [ var for var in data.columns if var not in cat_vars and var != 'SalePrice' ] # number of numerical variables len(num_vars)
Let's go ahead and find out which variables of the dataset contain missing values.
# make a list of the variables that contain missing values vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0] # determine percentage of missing values (expressed as decimals) # and display the result ordered by % of missin data data[vars_with_na].isnull().mean().sort_values(ascending=False)
Our dataset contains a few variables with a big proportion of missing values (4 variables at the top). And some other variables with a small percentage of missing observations.
This means that to train a machine learning model with this data set, we need to impute the missing data in these variables.
We can also visualize the percentage of missing values in the variables as follows:
# plot data[vars_with_na].isnull().mean().sort_values( ascending=False).plot.bar(figsize=(10, 4)) plt.ylabel('Percentage of missing data') plt.axhline(y=0.90, color='r', linestyle='-') plt.axhline(y=0.80, color='g', linestyle='-') plt.show()
# now we can determine which variables, from those with missing data, # are numerical and which are categorical cat_na = [var for var in cat_vars if var in vars_with_na] num_na = [var for var in num_vars if var in vars_with_na] print('Number of categorical variables with na: ', len(cat_na)) print('Number of numerical variables with na: ', len(num_na))
num_na
cat_na
Let's evaluate the price of the house in those observations where the information is missing. We will do this for each variable that shows missing data.
def analyse_na_value(df, var): # copy of the dataframe, so that we do not override the original data # see the link for more details about pandas.copy() # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html df = df.copy() # let's make an interim variable that indicates 1 if the # observation was missing or 0 otherwise df[var] = np.where(df[var].isnull(), 1, 0) # let's compare the median SalePrice in the observations where data is missing # vs the observations where data is available # determine the median price in the groups 1 and 0, # and the standard deviation of the sale price, # and we capture the results in a temporary dataset tmp = df.groupby(var)['SalePrice'].agg(['mean', 'std']) # plot into a bar graph tmp.plot(kind="barh", y="mean", legend=False, xerr="std", title="Sale Price", color='green') plt.show()
# let's run the function on each variable with missing data for var in vars_with_na: analyse_na_value(data, var)
In some variables, the average Sale Price in houses where the information is missing, differs from the average Sale Price in houses where information exists. This suggests that data being missing could be a good predictor of Sale Price.
Let's go ahead and find out what numerical variables we have in the dataset
print('Number of numerical variables: ', len(num_vars)) # visualise the numerical variables data[num_vars].head()
We have 4 year variables in the dataset:
We generally don't use date variables in their raw format. Instead, we extract information from them. For example, we can capture the difference in years between the year the house was built and the year the house was sold.
# list of variables that contain year information year_vars = [var for var in num_vars if 'Yr' in var or 'Year' in var] year_vars
# let's explore the values of these temporal variables for var in year_vars: print(var, data[var].unique()) print()
As expected, the values are years.
We can explore the evolution of the sale price with the years in which the house was sold:
# plot median sale price vs year in which it was sold data.groupby('YrSold')['SalePrice'].median().plot() plt.ylabel('Median House Price')
There has been a drop in the value of the houses. That is unusual, in real life, house prices typically go up as years go by.
Let's explore a bit further.
Let's plot the price of sale vs year in which it was built
# plot median sale price vs year in which it was built data.groupby('YearBuilt')['SalePrice'].median().plot() plt.ylabel('Median House Price')
We can see that newly built / younger houses tend to be more expensive.
Could it be that lately older houses were sold? Let's have a look at that.
For this, we will capture the elapsed years between the Year variables and the year in which the house was sold:
def analyse_year_vars(df, var): df = df.copy() # capture difference between a year variable and year # in which the house was sold df[var] = df['YrSold'] - df[var] df.groupby('YrSold')[var].median().plot() plt.ylabel('Time from ' + var) plt.show() for var in year_vars: if var !='YrSold': analyse_year_vars(data, var)
From the plots, we see that towards 2010, the houses sold had older garages, and had not been remodelled recently, that might explain why we see cheaper sales prices in recent years, at least in this dataset.
We can now plot instead the time since last remodelled, or time since built, and sale price, to see if there is a relationship.
def analyse_year_vars(df, var): df = df.copy() # capture difference between a year variable and year # in which the house was sold df[var] = df['YrSold'] - df[var] plt.scatter(df[var], df['SalePrice']) plt.ylabel('SalePrice') plt.xlabel(var) plt.show() for var in year_vars: if var !='YrSold': analyse_year_vars(data, var)
We see that there is a tendency to a decrease in price, with older houses. In other words, the longer the time between the house was built or remodeled and sale date, the lower the sale Price.
Which makes sense, cause this means that the house will have an older look, and potentially needs repairs.
Let's go ahead and find which variables are discrete, i.e., show a finite number of values
# let's male a list of discrete variables discrete_vars = [var for var in num_vars if len( data[var].unique()) < 20 and var not in year_vars] print('Number of discrete variables: ', len(discrete_vars))
# let's visualise the discrete variables data[discrete_vars].head()
These discrete variables tend to be qualifications (Qual) or grading scales (Cond), or refer to the number of rooms, or units (FullBath, GarageCars), or indicate the area of the room (KitchenAbvGr).
We expect higher prices, with bigger numbers.
Let's go ahead and analyse their contribution to the house price.
MoSold is the month in which the house was sold.
for var in discrete_vars: # make boxplot with Catplot sns.catplot(x=var, y='SalePrice', data=data, kind="box", height=4, aspect=1.5) # add data points to boxplot with stripplot sns.stripplot(x=var, y='SalePrice', data=data, jitter=0.1, alpha=0.3, color='k') plt.show()
For most discrete numerical variables, we see an increase in the sale price, with the quality, or overall condition, or number of rooms, or surface.
For some variables, we don't see this tendency. Most likely that variable is not a good predictor of sale price.
Let's go ahead and find the distribution of the continuous variables. We will consider continuous variables to all those that are not temporal or discrete.
# make list of continuous variables cont_vars = [ var for var in num_vars if var not in discrete_vars+year_vars] print('Number of continuous variables: ', len(cont_vars))
# let's visualise the continuous variables data[cont_vars].head()
# lets plot histograms for all continuous variables data[cont_vars].hist(bins=30, figsize=(15,15)) plt.show()
The variables are not normally distributed. And there are a particular few that are extremely skewed like 3SsnPorch, ScreenPorch and MiscVal.
Sometimes, transforming the variables to improve the value spread, improves the model performance. But it is unlikely that a transformation will help change the distribution of the super skewed variables dramatically.
We can apply a Yeo-Johnson transformation to variables like LotFrontage, LotArea, BsmUnfSF, and a binary transformation to variables like 3SsnPorch, ScreenPorch and MiscVal.
Let's go ahead and do that.
# first make a list with the super skewed variables # for later skewed = [ 'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'MiscVal' ]
# capture the remaining continuous variables cont_vars = [ 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', ]
# Let's go ahead and analyse the distributions of the variables # after applying a yeo-johnson transformation # temporary copy of the data tmp = data.copy() for var in cont_vars: # transform the variable - yeo-johsnon tmp[var], param = stats.yeojohnson(data[var]) # plot the histograms of the transformed variables tmp[cont_vars].hist(bins=30, figsize=(15,15)) plt.show()
For LotFrontage and MasVnrArea the transformation did not do an amazing job.
For the others, the values seem to be spread more evenly in the range.
Whether this helps improve the predictive power, remains to be seen. To determine if this is the case, we should train a model with the original values and one with the transformed values, and determine model performance, and feature importance. But that escapes the scope of this course.
Here, we will do a quick visual exploration here instead:
# let's plot the original or transformed variables # vs sale price, and see if there is a relationship for var in cont_vars: plt.figure(figsize=(12,4)) # plot the original variable vs sale price plt.subplot(1, 2, 1) plt.scatter(data[var], np.log(data['SalePrice'])) plt.ylabel('Sale Price') plt.xlabel('Original ' + var) # plot transformed variable vs sale price plt.subplot(1, 2, 2) plt.scatter(tmp[var], np.log(tmp['SalePrice'])) plt.ylabel('Sale Price') plt.xlabel('Transformed ' + var) plt.show()
By eye, the transformations seems to improve the relationship only for LotArea.
Let's try a different transformation now. Most variables contain the value 0, and thus we can't apply the logarithmic transformation, but we can certainly do that for the following variables:
["LotFrontage", "1stFlrSF", "GrLivArea"]
So let's do that and see if that changes the variable distribution and its relationship with the target.
# Let's go ahead and analyse the distributions of these variables # after applying a logarithmic transformation tmp = data.copy() for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]: # transform the variable with logarithm tmp[var] = np.log(data[var]) tmp[["LotFrontage", "1stFlrSF", "GrLivArea"]].hist(bins=30) plt.show()
The distribution of the variables are now more "Gaussian" looking.
Let's go ahead and evaluate their relationship with the target.
# let's plot the original or transformed variables # vs sale price, and see if there is a relationship for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]: plt.figure(figsize=(12,4)) # plot the original variable vs sale price plt.subplot(1, 2, 1) plt.scatter(data[var], np.log(data['SalePrice'])) plt.ylabel('Sale Price') plt.xlabel('Original ' + var) # plot transformed variable vs sale price plt.subplot(1, 2, 2) plt.scatter(tmp[var], np.log(tmp['SalePrice'])) plt.ylabel('Sale Price') plt.xlabel('Transformed ' + var) plt.show()
The transformed variables have a better spread of the values, which may in turn, help make better predictions.
Let's transform them into binary variables and see how predictive they are:
for var in skewed: tmp = data.copy() # map the variable values into 0 and 1 tmp[var] = np.where(data[var]==0, 0, 1) # determine mean sale price in the mapped values tmp = tmp.groupby(var)['SalePrice'].agg(['mean', 'std']) # plot into a bar graph tmp.plot(kind="barh", y="mean", legend=False, xerr="std", title="Sale Price", color='green') plt.show()
There seem to be a difference in Sale Price in the mapped values, but the confidence intervals overlap, so most likely this is not significant or predictive.
Let's go ahead and analyse the categorical variables present in the dataset.
print('Number of categorical variables: ', len(cat_vars))
# let's visualise the values of the categorical variables data[cat_vars].head()
Let's evaluate how many different categories are present in each of the variables.
# we count unique categories with pandas unique() # and then plot them in descending order data[cat_vars].nunique().sort_values(ascending=False).plot.bar(figsize=(12,5))
All the categorical variables show low cardinality, this means that they have only few different labels. That is good as we won't need to tackle cardinality during our feature engineering lecture.
There are a number of variables that refer to the quality of some aspect of the house, for example the garage, or the fence, or the kitchen. I will replace these categories by numbers increasing with the quality of the place or room.
The mappings can be obtained from the Kaggle Website. One example:
# re-map strings to numbers, which determine quality qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0} qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', ] for var in qual_vars: data[var] = data[var].map(qual_mappings)
exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4, 'Missing': 0, 'NA': 0} var = 'BsmtExposure' data[var] = data[var].map(exposure_mappings)
finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6} finish_vars = ['BsmtFinType1', 'BsmtFinType2'] for var in finish_vars: data[var] = data[var].map(finish_mappings)
garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3} var = 'GarageFinish' data[var] = data[var].map(garage_mappings)
fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4} var = 'Fence' data[var] = data[var].map(fence_mappings)
# capture all quality variables qual_vars = qual_vars + finish_vars + ['BsmtExposure','GarageFinish','Fence']
# now let's plot the house mean sale price based on the quality of the # various attributes for var in qual_vars: # make boxplot with Catplot sns.catplot(x=var, y='SalePrice', data=data, kind="box", height=4, aspect=1.5) # add data points to boxplot with stripplot sns.stripplot(x=var, y='SalePrice', data=data, jitter=0.1, alpha=0.3, color='k') plt.show()
For most attributes, the increase in the house price with the value of the variable, is quite clear.
# capture the remaining categorical variables # (those that we did not re-map) cat_others = [ var for var in cat_vars if var not in qual_vars ] len(cat_others)
Let's go ahead and investigate now if there are labels that are present only in a small number of houses:
def analyse_rare_labels(df, var, rare_perc): df = df.copy() # determine the % of observations per category tmp = df.groupby(var)['SalePrice'].count() / len(df) # return categories that are rare return tmp[tmp < rare_perc] # print categories that are present in less than # 1 % of the observations for var in cat_others: print(analyse_rare_labels(data, var, 0.01)) print()
Some of the categorical variables show multiple labels that are present in less than 1% of the houses.
Labels that are under-represented in the dataset tend to cause over-fitting of machine learning models.
That is why we want to remove them.
Finally, we want to explore the relationship between the categories of the different variables and the house sale price:
for var in cat_others: # make boxplot with Catplot sns.catplot(x=var, y='SalePrice', data=data, kind="box", height=4, aspect=1.5) # add data points to boxplot with stripplot sns.stripplot(x=var, y='SalePrice', data=data, jitter=0.1, alpha=0.3, color='k') plt.show()
Clearly, the categories give information on the SalePrice, as different categories show different median sale prices.
Disclaimer:
There is certainly more that can be done to understand the nature of this data and the relationship of these variables with the target, SalePrice. And also about the distribution of the variables themselves.
However, we hope that through this notebook we gave you a flavour of what data analysis looks like.
There are no models linked
There are no models linked
There are no datasets linked
There are no datasets linked