Part 1: Data Exploration

  1. Part 1: Data Exploration
  2. Objective:
  3. Import Python Packages:
  4. Import & Clean Data:
    1. Visualise the data
    2. Replacing NaN Values:
    3. Convert Categorical Data:
    4. Data Correlation:
    5. Visualising the output data
  5. PCA Analysis of the data:
  6. References:

I completed the WQU Machine Learning course 3 months ago and wanted to explore some new challenges. As a result I am exploring this Kaggle competition for leisure and am following a website cited in the references.

Objective:

Predict house prices

Import Python Packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import sklearn as sk

Import & Clean Data:

  • Two data sets are provided one for testing and the other for training.
  • We import each of the csv files into a pandas dataframe and remove any unwanted details
df_test = pd.read_csv('test.csv')
df_train = pd.read_csv('train.csv')
df_train.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

print(df_train.shape)
print(df_test.shape)
(1460, 81)
(1459, 80)

Visualise the data

  • Apparently, the data has so many NaN data it may be wise not to drop them.
  • Below we use a simple function to determine all the unique entries in the columns with many NaN values
  • Since NaN is not a good key in a dictionary we will need to employ a work around for the possible NaN vales
  • We notice that for the following columns the ‘NaN’ values are too many and we do not expect them to contribute significantly to the ML algorithm:
    • Alley
    • FirePlace
    • PoolQG
    • Fence
    • MiscFeture
def unique_tally(data):
    '''Returns a dictionary of the unique entries in a data column and their frequencies'''
    isnan = list(data.isnull())
    res = {}
    for i in range(len(data)):
        if isnan[i]:
            key_ = 'NaN'
        else:
            key_ = data[i]
        if key_ in res:
            res[key_] +=1
        else:
            res[key_] = 1
    return res
tallies = []
c = list(df_train.columns)
for col in c:
    tallies.append(unique_tally(df_train[col]))
indx =[]
for k in range(len(tallies)):
    if 'NaN' in tallies[k]:
        indx.append(k)
len(indx)
plt.figure(figsize= (20,30))
for i in range(len(indx)):
    plt.subplot(5,4,i+1)
    plt.bar(range(len(tallies[indx[i]])), list(tallies[indx[i]].values()), align='center', label = c[indx[i]])
    plt.xticks(range(len(tallies[indx[i]])), list(tallies[indx[i]].keys()), rotation=50)
    plt.legend()

png

# Analyse the test data in a similar way using Pandas functions
df_train.isnull().sum().sort_values(ascending=False)
PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
                 ... 
CentralAir          0
SaleCondition       0
Heating             0
TotalBsmtSF         0
Id                  0
Length: 81, dtype: int64
# Analyse the test data in a similar way using Pandas functions
df_test.isnull().sum().sort_values(ascending=False)
PoolQC         1456
MiscFeature    1408
Alley          1352
Fence          1169
FireplaceQu     730
               ... 
Electrical        0
CentralAir        0
HeatingQC         0
Foundation        0
Id                0
Length: 80, dtype: int64
  • So we will drop the following columns since they have more than 50% ‘NaN’ data values
    • PoolQC
    • MiscFeature
    • Alley
    • Fence
  • Also, we will drop the ‘Id’ column as it is irrelevant to the calculations
df_train1 = df_train.drop(['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis = 1)
df_test1 = df_test.drop(['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis = 1)

Replacing NaN Values:

  • Clearly, not all ‘NaN’ values need to be discarded.
  • The data columns have various data types and we need to replace the these missing values in a consistent manner
  • We do this for both the test and training data
  • We will replace ‘NaN’ values depending on some conditions as follows:
    • If the data in the column is numerical replace NaN with the mean
    • If the data in the column is of string type, replace NaN with modal category
  • We proceed as follows:
def replace_nan(df):
    col = 0
    c = list(df.columns)
    for i in df.dtypes:
        if i in [np.int64, np.float64]:
            df[c[col]]=df[c[col]].fillna(df[c[col]].mean())        
        elif i == object:
            df[c[col]]=df[c[col]].fillna(df[c[col]].mode()[0])
        col+=1
replace_nan(df_train1)
replace_nan(df_test1)
sb.heatmap(df_train1.isnull(),yticklabels=False,cbar=False,cmap='coolwarm')

png

sb.heatmap(df_test1.isnull(),yticklabels=False,cbar=False,cmap='coolwarm')

png

Convert Categorical Data:

  • All categorical data needs to be converted into numerical categories
  • This will enable the algorithms to understand the data
def category_to_num(df):
    '''Takes in a column of data and determines how many unique vakues there are
        Each value is assigned a unique natural number & the data is updated
        Returns the categories.'''
    categs = sorted(list(df_train1[col[ci]].unique()))
    for num in range(len(categs)):
        df.loc[df==categs[num]] = num
    return categs
ci = 0
col = list(df_train1.columns)
categ = {}
for dt in df_train1.dtypes:
    if dt == object:
        categs= category_to_num(df_train1[col[ci]])
        categ[col[ci]] = categs
    ci+=1
df_train1.head()
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 3 65.0 8450 1 3 3 0 4 0 ... 0 0 0 0 0 2 2008 8 4 208500
1 20 3 80.0 9600 1 3 3 0 2 0 ... 0 0 0 0 0 5 2007 8 4 181500
2 60 3 68.0 11250 1 0 3 0 4 0 ... 0 0 0 0 0 9 2008 8 4 223500
3 70 3 60.0 9550 1 0 3 0 0 0 ... 272 0 0 0 0 2 2006 8 0 140000
4 60 3 84.0 14260 1 0 3 0 2 0 ... 0 0 0 0 0 12 2008 8 4 250000

5 rows × 76 columns

ci = 0
col = list(df_test1.columns)
categ = {}
for dt in df_test1.dtypes:
    if dt == object:
        categs= category_to_num(df_test1[col[ci]])
        categ[col[ci]] = categs
    ci+=1
plt.figure(figsize=(10,5))
y = df_train.SalePrice
sb.set_style('whitegrid')
plt.subplot(121)
sb.distplot(y)

df_train['SalePrice_log'] = np.log(df_train.SalePrice)
y2 = df_train.SalePrice_log
plt.subplot(122)
sb.distplot(y2)
plt.show()
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
C:\Users\zmakumbe\.conda\envs\wqu_ml_fin\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

Data Correlation:

  • It is important to determine any interdependencies if they exist.
# Lets explore the correlations in our data set 
plt.figure(figsize=(20,20))
sb.heatmap(df_train.corr())
<AxesSubplot:>

png

Visualising the output data

  • Next, taking the Sale Price data (which will bo our output variable) we plot a bar graph
  • From above, we find that the data is skewed but the log-transoformed data has a much better distribution
  • Such transformations help us avoid having to remove outliers.

PCA Analysis of the data:

  • Considering how many columns we have as well as the hunch we have pertaining to the ‘NaN’ values, we expect some columns to be redundant
  • We conduct a Principal Component Analysis (PCA) in order to determine if a smaller set of the data can be used to determine the output
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler 
#Setting the input and output variables
x = df_train1[c[:-1]]
y = df_train1['SalePrice']

#Splitting the data into training and testing data for a trial run
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
sc = StandardScaler()

X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test) 
# Applying PCA function on training 
# and testing set of X component 
  
pca = PCA(n_components = 50) 
  
X_train = pca.fit_transform(X_train) 
X_test = pca.transform(X_test) 
  
explained_variance = pca.explained_variance_ratio_
print('%-age of variance explained by the 45 principal components')
np.round(explained_variance*100,1)
%-age of variance explained by the 45 principal components





array([13.7,  5.6,  4.9,  4. ,  3. ,  2.8,  2.4,  2.3,  2.2,  2.1,  2. ,
        2. ,  1.9,  1.8,  1.7,  1.6,  1.6,  1.6,  1.5,  1.5,  1.5,  1.5,
        1.4,  1.4,  1.4,  1.3,  1.3,  1.2,  1.2,  1.2,  1.2,  1.1,  1.1,
        1.1,  1. ,  1. ,  1. ,  0.9,  0.9,  0.9,  0.9,  0.8,  0.8,  0.8,
        0.8,  0.7,  0.7,  0.7,  0.7,  0.7])
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_) )
plt.xticks(np.arange(start=0, stop=len(pca.explained_variance_ratio_), step=1),rotation = 70)
plt.grid()
plt.show()

png

  • After conducting the PCA analysis we find that we need at least
    • 35 data columns to explain at least 80% of the variation in the data,
    • 40 data columns to explain at least 85% of the variation in the data, and
    • 47 data columns to explain at least 90% of the variation in the data
# Create linear regression object 
regr = LinearRegression() 
# Fit 
regr.fit(X_train, y_train) 
# Calibration 
regr.score(X_test, y_test)
0.6565305727301065
pca = PCA(n_components = 50) 

regr_pca = LinearRegression() 

# Fit 
X_pca_train = pca.fit_transform(X_train) 
X_pca_test = pca.fit_transform(X_test) 

regr_pca.fit(X_pca_train, y_train) 
regr.score(X_pca_test, y_test)
#cross_val_score(regr_pca, X_pca_train, y_train).mean()
-1.0756078955375763

References:

  1. https://www.educative.io/edpresso/how-to-check-if-a-key-exists-in-a-python-dictionary
  2. https://towardsdatascience.com/predicting-house-prices-with-machine-learning-62d5bcd0d68f