House Price Prediction:
Part 1: Data Exploration
- Part 1: Data Exploration
- Objective:
- Import Python Packages:
- Import & Clean Data:
- PCA Analysis of the data:
- References:
I completed the WQU Machine Learning course 3 months ago and wanted to explore some new challenges. As a result I am exploring this Kaggle competition for leisure and am following a website cited in the references.
Objective:
Predict house prices
Import Python Packages:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import sklearn as sk
Import & Clean Data:
- Two data sets are provided one for testing and the other for training.
- We import each of the csv files into a pandas dataframe and remove any unwanted details
df_test = pd.read_csv('test.csv')
df_train = pd.read_csv('train.csv')
df_train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
print(df_train.shape)
print(df_test.shape)
(1460, 81)
(1459, 80)
Visualise the data
- Apparently, the data has so many NaN data it may be wise not to drop them.
- Below we use a simple function to determine all the unique entries in the columns with many NaN values
- Since NaN is not a good key in a dictionary we will need to employ a work around for the possible NaN vales
- We notice that for the following columns the ‘NaN’ values are too many and we do not expect them to contribute significantly to the ML algorithm:
- Alley
- FirePlace
- PoolQG
- Fence
- MiscFeture
def unique_tally(data):
'''Returns a dictionary of the unique entries in a data column and their frequencies'''
isnan = list(data.isnull())
res = {}
for i in range(len(data)):
if isnan[i]:
key_ = 'NaN'
else:
key_ = data[i]
if key_ in res:
res[key_] +=1
else:
res[key_] = 1
return res
tallies = []
c = list(df_train.columns)
for col in c:
tallies.append(unique_tally(df_train[col]))
indx =[]
for k in range(len(tallies)):
if 'NaN' in tallies[k]:
indx.append(k)
len(indx)
plt.figure(figsize= (20,30))
for i in range(len(indx)):
plt.subplot(5,4,i+1)
plt.bar(range(len(tallies[indx[i]])), list(tallies[indx[i]].values()), align='center', label = c[indx[i]])
plt.xticks(range(len(tallies[indx[i]])), list(tallies[indx[i]].keys()), rotation=50)
plt.legend()
# Analyse the test data in a similar way using Pandas functions
df_train.isnull().sum().sort_values(ascending=False)
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
...
CentralAir 0
SaleCondition 0
Heating 0
TotalBsmtSF 0
Id 0
Length: 81, dtype: int64
# Analyse the test data in a similar way using Pandas functions
df_test.isnull().sum().sort_values(ascending=False)
PoolQC 1456
MiscFeature 1408
Alley 1352
Fence 1169
FireplaceQu 730
...
Electrical 0
CentralAir 0
HeatingQC 0
Foundation 0
Id 0
Length: 80, dtype: int64
- So we will drop the following columns since they have more than 50% ‘NaN’ data values
- PoolQC
- MiscFeature
- Alley
- Fence
- Also, we will drop the ‘Id’ column as it is irrelevant to the calculations
df_train1 = df_train.drop(['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis = 1)
df_test1 = df_test.drop(['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis = 1)
Replacing NaN Values:
- Clearly, not all ‘NaN’ values need to be discarded.
- The data columns have various data types and we need to replace the these missing values in a consistent manner
- We do this for both the test and training data
- We will replace ‘NaN’ values depending on some conditions as follows:
- If the data in the column is numerical replace NaN with the mean
- If the data in the column is of string type, replace NaN with modal category
- We proceed as follows:
def replace_nan(df):
col = 0
c = list(df.columns)
for i in df.dtypes:
if i in [np.int64, np.float64]:
df[c[col]]=df[c[col]].fillna(df[c[col]].mean())
elif i == object:
df[c[col]]=df[c[col]].fillna(df[c[col]].mode()[0])
col+=1
replace_nan(df_train1)
replace_nan(df_test1)
sb.heatmap(df_train1.isnull(),yticklabels=False,cbar=False,cmap='coolwarm')
sb.heatmap(df_test1.isnull(),yticklabels=False,cbar=False,cmap='coolwarm')
Convert Categorical Data:
- All categorical data needs to be converted into numerical categories
- This will enable the algorithms to understand the data
def category_to_num(df):
'''Takes in a column of data and determines how many unique vakues there are
Each value is assigned a unique natural number & the data is updated
Returns the categories.'''
categs = sorted(list(df_train1[col[ci]].unique()))
for num in range(len(categs)):
df.loc[df==categs[num]] = num
return categs
ci = 0
col = list(df_train1.columns)
categ = {}
for dt in df_train1.dtypes:
if dt == object:
categs= category_to_num(df_train1[col[ci]])
categ[col[ci]] = categs
ci+=1
df_train1.head()
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | ... | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | 3 | 65.0 | 8450 | 1 | 3 | 3 | 0 | 4 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | 8 | 4 | 208500 |
1 | 20 | 3 | 80.0 | 9600 | 1 | 3 | 3 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | 8 | 4 | 181500 |
2 | 60 | 3 | 68.0 | 11250 | 1 | 0 | 3 | 0 | 4 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | 8 | 4 | 223500 |
3 | 70 | 3 | 60.0 | 9550 | 1 | 0 | 3 | 0 | 0 | 0 | ... | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | 8 | 0 | 140000 |
4 | 60 | 3 | 84.0 | 14260 | 1 | 0 | 3 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | 8 | 4 | 250000 |
5 rows × 76 columns
ci = 0
col = list(df_test1.columns)
categ = {}
for dt in df_test1.dtypes:
if dt == object:
categs= category_to_num(df_test1[col[ci]])
categ[col[ci]] = categs
ci+=1
plt.figure(figsize=(10,5))
y = df_train.SalePrice
sb.set_style('whitegrid')
plt.subplot(121)
sb.distplot(y)
df_train['SalePrice_log'] = np.log(df_train.SalePrice)
y2 = df_train.SalePrice_log
plt.subplot(122)
sb.distplot(y2)
plt.show()
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Data Correlation:
- It is important to determine any interdependencies if they exist.
# Lets explore the correlations in our data set
plt.figure(figsize=(20,20))
sb.heatmap(df_train.corr())
<AxesSubplot:>
Visualising the output data
- Next, taking the Sale Price data (which will bo our output variable) we plot a bar graph
- From above, we find that the data is skewed but the log-transoformed data has a much better distribution
- Such transformations help us avoid having to remove outliers.
PCA Analysis of the data:
- Considering how many columns we have as well as the hunch we have pertaining to the ‘NaN’ values, we expect some columns to be redundant
- We conduct a Principal Component Analysis (PCA) in order to determine if a smaller set of the data can be used to determine the output
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
#Setting the input and output variables
x = df_train1[c[:-1]]
y = df_train1['SalePrice']
#Splitting the data into training and testing data for a trial run
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
pca = PCA(n_components = 50)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
print('%-age of variance explained by the 45 principal components')
np.round(explained_variance*100,1)
%-age of variance explained by the 45 principal components
array([13.7, 5.6, 4.9, 4. , 3. , 2.8, 2.4, 2.3, 2.2, 2.1, 2. ,
2. , 1.9, 1.8, 1.7, 1.6, 1.6, 1.6, 1.5, 1.5, 1.5, 1.5,
1.4, 1.4, 1.4, 1.3, 1.3, 1.2, 1.2, 1.2, 1.2, 1.1, 1.1,
1.1, 1. , 1. , 1. , 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8,
0.8, 0.7, 0.7, 0.7, 0.7, 0.7])
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_) )
plt.xticks(np.arange(start=0, stop=len(pca.explained_variance_ratio_), step=1),rotation = 70)
plt.grid()
plt.show()
- After conducting the PCA analysis we find that we need at least
- 35 data columns to explain at least 80% of the variation in the data,
- 40 data columns to explain at least 85% of the variation in the data, and
- 47 data columns to explain at least 90% of the variation in the data
# Create linear regression object
regr = LinearRegression()
# Fit
regr.fit(X_train, y_train)
# Calibration
regr.score(X_test, y_test)
0.6565305727301065
pca = PCA(n_components = 50)
regr_pca = LinearRegression()
# Fit
X_pca_train = pca.fit_transform(X_train)
X_pca_test = pca.fit_transform(X_test)
regr_pca.fit(X_pca_train, y_train)
regr.score(X_pca_test, y_test)
#cross_val_score(regr_pca, X_pca_train, y_train).mean()
-1.0756078955375763
References:
- https://www.educative.io/edpresso/how-to-check-if-a-key-exists-in-a-python-dictionary
- https://towardsdatascience.com/predicting-house-prices-with-machine-learning-62d5bcd0d68f