Analysis on Housing Data

This project utilized an existing dataset ( Ames Housing Data @ Kaggle ) to predict the residential house prices. Through this project, algorithm was developed to reliably estimate the value of residential houses based on fixed features. Characteristics of the houses that the company could cost-effectively change/renovate were identified. Dataset was cleaned and analyzed thoroughly with comprehensive feature engineering. Model was then trained on pre-2010 data and tested its performance using 2010 housing data. The final model performance was characterized and the key features affecting the price were assessed and compared.

The project was splitted into 2 main parts

Part 1 : Estimating the value of homes from fixed characteristics

Fixed characteristics refer to features that would involve major construction of the house

Part 2 : Determine any value of changeable property characteristics unexplained by the fixed ones.

The effects in dollars of the renovate-able features were evaluated. By using the model obtained in Part 1, a review of its appropriateness to be used to assist the decision making if to buy/invest in the property was assessed. The variance in price remaining explainable by those features was therefore used as an indicator to justify such investment potentials.

Part 1 :

Estimating the Value of Houses from Fixed Characteristics

1.1 Data Preparation & Cleaning

Filter raw dataset to have only residential related data

residential = house[house['MSZoning'] !='C (all)']

// quick graphical view to check columns with missing values
fig = plt.figure(figsize=(20,10))
sns.heatmap(residential.isnull(), cmap='viridis', cbar=False, yticklabels=False)

Percentage of missing values for affected columns was calculated and tabulated as below:

Columns with >45% missing values would be dropped in view of any replacement values would be assumption that impacted the accuracy significantly

Statistic of the remaining columns was reviewed in order to decide the best estimated values for those Na values. The review outcomes lead to the following decisions:

for LotFrontage : mean and median were both at ~same value, the missing value to be replaced using median value
for MasVnrArea : since there was only 8 Na and 50% of data is at 0, missing value to be replaced with median value of 0
for MasVnrType : missing values were found corresponding to MasVnrArea. Since majority of data is None and MasVrnArea at 0, missing value to be replaced as None type
for Bsmt related columns : data dictionary showed NA as No Basement. Further confirmation showed corresponding numerical columns with with Bsmt 0, missing value to be replaced with NA type
for Electrical : since majority was under SBrkr, missing value to be replaced as SBrkr. This wouldn’t affect the overall distribution
for Garage related columns : data dictionary showed NA as No Garage. Further confirmation showed corresponding numerical columns with Garage 0 or Na, missing value to be replaced with NA type and GarageYrBlt as 0

After imputation had been done, verification was carried out to check if there was any escapee during the replacement process

// final verification if anymore missing values

fig = plt.figure(figsize=(20,10))
sns.heatmap(residential.isnull(),cbar=False, yticklabels=False, cmap='viridis')

Result showed no more missing value. Data set was clean for further analysis.

1.2 : Identify fixed features for analysis

Fixed features refer to those features of the house that would involve major construction. The final fixed features were screened as below :

fixed_features = data_fix.columns
fixed_features

1.3 : Exploratory Data Analysis

// plot heatmap to check correlation

data_corr = data_fix.corr()
// Set the default matplotlib figure size:
fig, ax = plt.subplots(figsize=(15, 10))
// Generate a mask for the upper triangle (taken from seaborn example gallery)
mask = np.zeros_like(data_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
// Plot the heatmap with seaborn.
// Assign the matplotlib axis the function returns. This will let us resize the labels.
ax = sns.heatmap(data_corr, mask=mask, ax=ax, cmap='Blues')
// Resize the labels.
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14)
plt.show()
// sns.heatmap(residential.corr(),cmap='Blues')

Visualization of the top 5 highly correlated variables :

1.4 : Modeling

// convert to matrices to prepare training data set with Patsy function

import patsy
y_fix, X_fix = patsy.dmatrices(formula, data=data_fix, return_type='dataframe')

// to confirm all categorical columns have been transformed with Patcy function above

X_fix.head()

	MSZoning[T.RL]	Street[T.Pave]	LotShape[T.Reg]	LandContour[T.Lvl]	LotConfig[T.FR2]	LotConfig[T.Inside]	Neighborhood[T.CollgCr]	Neighborhood[T.Crawfor]	Neighborhood[T.NoRidge]	Neighborhood[T.Veenker]	MasVnrType[T.BrkFace]	MasVnrType[T.None]	Foundation[T.CBlock]	Foundation[T.PConc]	BsmtExposure[T.Gd]	BsmtExposure[T.Mn]	BsmtExposure[T.No]	Functional[T.Typ]	GarageType[T.Attchd]	GarageType[T.Detchd]	GarageFinish[T.RFn]	GarageFinish[T.Unf]	PavedDrive[T.Y]	SaleType[T.WD]	MSSubClass	LotFrontage	LotArea	YearBuilt	YearRemodAdd	MasVnrArea	BsmtUnfSF	TotalBsmtSF	FirstFlrSF	SecondFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	MoSold	YrSold	SalePrice
0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	60.0	65.0	8450.0	2003.0	2003.0	196.0	150.0	856.0	856.0	854.0	1710.0	1.0	0.0	2.0	1.0	3.0	1.0	8.0	0.0	2003.0	2.0	548.0	0.0	61.0	0.0	2.0	2008.0	208500.0
1	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	20.0	80.0	9600.0	1976.0	1976.0	0.0	284.0	1262.0	1262.0	0.0	1262.0	0.0	1.0	2.0	0.0	3.0	1.0	6.0	1.0	1976.0	2.0	460.0	298.0	0.0	0.0	5.0	2007.0	181500.0
2	1.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	60.0	68.0	11250.0	2001.0	2002.0	162.0	434.0	920.0	920.0	866.0	1786.0	1.0	0.0	2.0	1.0	3.0	1.0	6.0	1.0	2001.0	2.0	608.0	0.0	42.0	0.0	9.0	2008.0	223500.0
3	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	70.0	60.0	9550.0	1915.0	1970.0	0.0	540.0	756.0	961.0	756.0	1717.0	1.0	0.0	1.0	0.0	3.0	1.0	7.0	1.0	1998.0	3.0	642.0	0.0	35.0	272.0	2.0	2006.0	140000.0
4	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	60.0	84.0	14260.0	2000.0	2000.0	350.0	490.0	1145.0	1145.0	1053.0	2198.0	1.0	0.0	2.0	1.0	4.0	1.0	9.0	1.0	2000.0	3.0	836.0	192.0	84.0	0.0	12.0	2008.0	250000.0

// Split data to train set (before 2010) and test set ( after 2010)

X_fix_train = X_fix[X_fix['YrSold']<2010].drop('SalePrice', axis=1)
X_fix_test = X_fix[X_fix['YrSold']>=2010].drop('SalePrice', axis=1)

y_fix_train = X_fix[X_fix['YrSold']<2010]['SalePrice']
y_fix_test = X_fix[X_fix['YrSold']>=2010]['SalePrice']

// standardize the X variables:

Xs_fix_train = ss.fit_transform(X_fix_train)
Xs_fix_test = ss.fit_transform(X_fix_test)

// modeling with Lasso:

model_lasso.fit(Xs_fix_train, y_fix_train)
predicted_y_fix = model_lasso.predict(Xs_fix_test)

coefficient_fix = model_lasso.coef_

score_fix = model_lasso.score(Xs_fix_test, y_fix_test)
score_fix

'''
0.8803540439252586
'''

Visualization of residual/gap of predicted vs actual sales price and their coefficient differences

residual_fix_data = pd.DataFrame(list(zip(y_fix_test, predicted_y_fix)), columns=['Actual SalePrice', 'Predicted SalePrice'])

sns.lmplot(x='Actual SalePrice', y='Predicted SalePrice', data=residual_fix_data)
plt.title('Prediction of SalePrice with Fix features only')
plt.show()

sns.residplot(x='Actual SalePrice', y='Predicted SalePrice', data=residual_fix_data)
plt.title('Residual Plot of SalePrice with Fix features only')
plt.show()

coefficient_fix_df = pd.DataFrame(list(zip(X_fix_test.columns, coefficient_fix)),columns=['Features','Coefficient'])
coefficient_fix_df.sort_values('Coefficient', ascending=True).plot(x='Features', y='Coefficient', kind='barh', figsize=(15,30))
plt.show()

Partial view of the coeefficient graph

no_of_variables_in_modeling2 = len(coefficient_fix_df[coefficient_fix_df.values != 0])
no_of_variables_in_modeling2

'''
186
'''

MAE_2 = mean_absolute_error(y_fix_test, predicted_y_fix)
MSE_2 = mean_squared_error(y_fix_test, predicted_y_fix)
RMSE_2 = np.sqrt(mean_squared_error(y_fix_test, predicted_y_fix))

print('MAE\t = {}'.format(MAE_2))
print('MSE\t = {}'.format(MSE_2))
print('RMSE\t = {}'.format(RMSE_2))

Part 2 :

Determine any value of changeable property characteristics unexplained by the fixed ones.

Workflow in P1 was repeated, but for part 2 all variables ( fix + nonfix ) were included into calculation.

For modeling, Grid Search on Lasso was used to assist finding the best parameters for price prediction.

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

alphas = np.logspace(-5, 5, 20)

model_gs = GridSearchCV(estimator=Lasso(), param_grid=dict(alpha=alphas), cv=10, scoring='r2')
model_gs.fit(Xs_mix_train, y_mix_train)

model_gs.best_estimator_.score(Xs_mix_test, y_mix_test)

'''
0.886990270681429
'''

Model score was found very close to earlier score obtained in Part 1, indicating there could be little/no difference caused by renovation effect. Model using LassoCV was finalized and continued with the calculation on cost of renovation effects.

2.1 The effect in dollars of SalePrice with renovate-able features

Relevant information was tabulated to summarize the effect in dollars on SalePrice with renovatable features

cost_effect_comparison = merge_table_mix_fix[['Actual SalePrice_Mix','Predicted SalePrice_Mix','Predicted SalePrice_Fix']].copy()

// add new column to calculate cost of renovation effect:
cost_effect_comparison['Cost of Renovation Effect'] = merge_table_mix_fix['Predicted SalePrice_Mix'] - merge_table_mix_fix['Predicted SalePrice_Fix']

// add new column to calculate the % increase in predicted SalePrice factoring renovate-able features:
cost_effect_comparison['% of Increase due to Renovation'] = cost_effect_comparison['Cost of Renovation Effect']/cost_effect_comparison['Predicted SalePrice_Mix']*100

cost_effect_comparison.sort_values('% of Increase due to Renovation', ascending=False)

// cost of Reno effect = Predicted SalePrice_Mix - Predicted SalePrice_Fix
// Impact of reno effect is presented through % of Increase in SalePrice

Partial view of the tabulated cost of renovation effects :

	Actual SalePrice_Mix	Predicted SalePrice_Mix	Predicted SalePrice_Fix	Cost of Renovation Effect	% of Increase due to Renovation
84	55000.0	61664.957714	42891.117349	18773.840365	30.44
138	164000.0	237093.364080	179208.778886	57884.585194	24.41
77	88000.0	120260.063071	93523.405892	26736.657180	22.23
120	100000.0	122506.492606	96293.063784	26213.428821	21.40
99	155000.0	156570.449387	135831.029298	20739.420089	13.25
147	139000.0	138312.341658	120976.916906	17335.424753	12.53

// Visualize effect on SalePrice with renovate-able features
// Effect on each individual listing could therefore be identified visually

fig = plt.figure(figsize=(20,10))
sns.barplot(x=cost_effect_comparison.index, y=cost_effect_comparison['% of Increase due to Renovation'])

// distribution of the effect of renovate-able features on SalePrice:

fig = plt.figure(figsize=(15,7))
sns.distplot( cost_effect_comparison['% of Increase due to Renovation'])

Outlier to the left( negative effect of renovate-able feature ) could be due to odd/non prime location in which original house condition is also possibly bad causing high renovation cost needed

// Check outlier unit to see if it is tally with the assumption 

residential[residential.index == 160]

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	FirstFlrSF	SecondFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	ThirdSsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
160	161	20	RL	70.0	11120	Pave	IR1	Lvl	AllPub	CulDSac	Gtl	Veenker	Norm	Norm	1Fam	1Story	6	6	1984	1984	Gable	CompShg	Plywood	Plywood	None	0.0	TA	TA	PConc	Gd	TA	No	BLQ	660	Unf	0	572	1232	GasA	TA	Y	SBrkr	1232	0	0	1232	0	0	2	0	3	1	TA	6	Typ	0	Attchd	1984.0	Unf	2	516	TA	TA	Y	0	0	0	0	0	0	0	6	2008	WD	Normal	162500

From data dictionary, it was found that :

20 : 1-STORY 1946 & NEWER ALL STYLES => old house
RL : Residential Low Density => non prime location
IR1 : Slightly irregular => irregular lot size
Lvl : Near Flat/Level
CulDSac : Cul-de-sac
Gtl : Gentle slope
1Fam : Single-family Detached
OverallQual : 6 Above Average => quality is still within the mean quality level of majority

The closest explanation to the outlier point in which the unit had low predicted saleprice even after factoring renovate-able features was due to the house physical condition of being ancient old house with irregular size situated at non-prime location.

2.2 Use of Coefficients for Decision Making : To buy / Not to Buy

For ease of understanding, the coefficients were tabulated and visualized through graphical view

// visualize the coefficient :

coefficient_mix_df = pd.DataFrame(list(zip(X_mix_test.columns, coefficient_mix)),columns=['Features','Coefficient'])
coefficient_mix_df.sort_values('Coefficient', ascending=False)

Partial view of the tabulated coefficients as below :

// graphical view of coefficient differences :

// exclude features with 0 coeeficient value to be displayed on the chart
coefficient_plot = coefficient_mix_df[coefficient_mix_df['Coefficient']!=0]

coefficient_plot.sort_values('Coefficient', ascending=True).plot(x='Features', y='Coefficient', kind='barh', figsize=(15,30))
plt.show()

Partial view of the graphical representation of coefficients as below :

Top 3 highest coefficients could be used as estimators for higher achievable predicted saleprice

GrLivArea: Above grade (ground) living area square feet
OverallQual: Rates the overall material and finish of the house
Neighborhood: Northridge Heights

2.3 Variance in Price Remaining Explainable by Features

[ Which renovate-able features are effective in increasing predicted SalePrice ]

To further identify & finalize real renovatable_features excluding physical conditions:

renovatable_features = ['Condition1[T.Feedr]','Condition1[T.Norm]','Condition1[T.PosA]','Condition1[T.PosN]','Condition1[T.RRAe]','Condition1[T.RRAn]','Condition1[T.RRNe]','Condition1[T.RRNn]','Condition2[T.Feedr]','Condition2[T.Norm]','Condition2[T.PosA]','Condition2[T.PosN]','Condition2[T.RRAe]','Condition2[T.RRAn]','Condition2[T.RRNn]','HouseStyle[T.1.5Unf]','HouseStyle[T.1Story]','HouseStyle[T.2.5Fin]','HouseStyle[T.2.5Unf]','HouseStyle[T.2Story]','HouseStyle[T.SFoyer]','HouseStyle[T.SLvl]','RoofStyle[T.Gable]','RoofStyle[T.Gambrel]','RoofStyle[T.Hip]','RoofStyle[T.Mansard]','RoofStyle[T.Shed]','RoofMatl[T.CompShg]','RoofMatl[T.Membran]','RoofMatl[T.Metal]','RoofMatl[T.Roll]','RoofMatl[T.Tar&Grv]','RoofMatl[T.WdShake]','RoofMatl[T.WdShngl]','Exterior1st[T.AsphShn]','Exterior1st[T.BrkComm]','Exterior1st[T.BrkFace]','Exterior1st[T.CBlock]','Exterior1st[T.CemntBd]','Exterior1st[T.HdBoard]','Exterior1st[T.ImStucc]','Exterior1st[T.MetalSd]','Exterior1st[T.Plywood]','Exterior1st[T.Stone]','Exterior1st[T.Stucco]','Exterior1st[T.VinylSd]','Exterior1st[T.WdSdng]','Exterior1st[T.WdShing]','Exterior2nd[T.AsphShn]','Exterior2nd[T.BrkCmn]','Exterior2nd[T.BrkFace]','Exterior2nd[T.CBlock]','Exterior2nd[T.CmentBd]','Exterior2nd[T.HdBoard]','Exterior2nd[T.ImStucc]','Exterior2nd[T.MetalSd]','Exterior2nd[T.Other]','Exterior2nd[T.Plywood]','Exterior2nd[T.Stone]','Exterior2nd[T.Stucco]','Exterior2nd[T.VinylSd]','Exterior2nd[T.WdSdng]','Exterior2nd[T.WdShng]','MasVnrType[T.BrkFace]','MasVnrType[T.None]','MasVnrType[T.Stone]','ExterQual[T.Fa]','ExterQual[T.Gd]','ExterQual[T.TA]','ExterCond[T.Fa]','ExterCond[T.Gd]','ExterCond[T.Po]','ExterCond[T.TA]','BsmtQual[T.Fa]','BsmtQual[T.Gd]','BsmtQual[T.NA]','BsmtQual[T.TA]','BsmtCond[T.Gd]','BsmtCond[T.NA]','BsmtCond[T.Po]','BsmtCond[T.TA]','BsmtExposure[T.Gd]','BsmtExposure[T.Mn]','BsmtExposure[T.NA]','BsmtExposure[T.No]','BsmtFinType1[T.BLQ]','BsmtFinType1[T.GLQ]','BsmtFinType1[T.LwQ]','BsmtFinType1[T.NA]','BsmtFinType1[T.Rec]','BsmtFinType1[T.Unf]','BsmtFinType2[T.BLQ]','BsmtFinType2[T.GLQ]','BsmtFinType2[T.LwQ]','BsmtFinType2[T.NA]','BsmtFinType2[T.Rec]','BsmtFinType2[T.Unf]','Heating[T.GasA]','Heating[T.GasW]','Heating[T.Grav]','Heating[T.OthW]','Heating[T.Wall]','HeatingQC[T.Fa]','HeatingQC[T.Gd]','HeatingQC[T.Po]','HeatingQC[T.TA]','CentralAir[T.Y]','Electrical[T.FuseF]','Electrical[T.FuseP]','Electrical[T.Mix]','Electrical[T.SBrkr]','KitchenQual[T.Fa]','KitchenQual[T.Gd]','KitchenQual[T.TA]','GarageFinish[T.NA]','GarageFinish[T.RFn]','GarageFinish[T.Unf]','GarageQual[T.Fa]','GarageQual[T.Gd]','GarageQual[T.NA]','GarageQual[T.Po]','GarageQual[T.TA]','GarageCond[T.Fa]','GarageCond[T.Gd]','GarageCond[T.NA]','GarageCond[T.Po]','GarageCond[T.TA]','SaleCondition[T.AdjLand]','SaleCondition[T.Alloca]','SaleCondition[T.Family]','SaleCondition[T.Normal]','SaleCondition[T.Partial]','OverallQual','OverallCond','Id']

df_renovatable_test = X_mix_test_full[renovatable_features]
X_renovateble = df_renovatable_test.drop(['SalePrice','Id'], axis=1)
Xs_renovateble = ss.fit_transform(X_renovateble)
model_lasso.fit(Xs_renovateble, df_renovatable_test['SalePrice'])
renovateable_coef = model_lasso.coef_

After matching the right index corresponding to the coefficients, the effect of the variance was then again tabulated and visualized through graphical view

// visualize the coefficient :

renovateable_coef_df = pd.DataFrame(list(zip(X_renovateble.columns, renovateable_coef)),columns=['Features','Coefficient'])
renovateable_coef_df.sort_values('Coefficient', ascending=False)

Partial view of the tabulated coefficients as below :

// graphical view of coefficient differences :

// exclude features with 0 coefficient value to be displayed on the chart
coefficient_plot_renovateable_postive = renovateable_coef_df[renovateable_coef_df['Coefficient']!=0]

coefficient_plot_renovateable_postive.sort_values('Coefficient', ascending=True).plot(x='Features', y='Coefficient', kind='barh', figsize=(15,10))
plt.show()

Graph above showed the Top 3 estimators of renovatable features were :

OverallQual: Rates the overall material and finish of the house
BsmtExposure: Gd Good Exposure
SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)

reno_feature_positive_increase_SP['OverallQual'].value_counts()

'''
6.0    30
5.0    18
7.0    17
8.0    12
9.0     4
4.0     3
Name: OverallQual, dtype: int64
'''

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
ax0, ax1 = axes.flatten()

sns.distplot(reno_feature_positive_increase_SP['OverallQual'], ax=ax0)
sns.distplot(reno_feature_positive_increase_SP['OverallCond'], ax=ax1)

As data at this point already filtered down to only those with positive increase in SalePrice due to renovation, The plot above showed that :

Overall Qual shall be 6 and above
Overall Condition shall be 5 and above

A sub-dataframe was formed in order to compare renovatable features vs total no. of units with positive increase in price due to that feature

count_plot_data = reno_feature_positive_increase_SP[renovatable_features].drop(['OverallQual','OverallCond','Id', 'SalePrice'], axis=1)

key = []
count_sum_positive_renoEffect = []
for col in count_plot_data.columns:
    key.append(col)
    count_sum_positive_renoEffect.append(count_plot_data[col].values.sum())

count_plot_df = pd.DataFrame()
count_plot_df['Feature'] = key
count_plot_df['Total_unit_positive_reno_effect'] = count_sum_positive_renoEffect

count_plot_df.sort_values(['Total_unit_positive_reno_effect'], ascending=False)

Partial view of the tabulated sub-dataframe as below :

effective_reno_features_positive_predictedSP = count_plot_df[count_plot_df['Total_unit_positive_reno_effect']!=0]
effective_reno_features_positive_predictedSP.sort_values('Total_unit_positive_reno_effect', ascending=True).plot(x='Feature', y='Total_unit_positive_reno_effect', kind='barh', figsize=(15,50))
plt.show()

Partial view of the renovatable features as below :

Top 3 estimators of renovatable features were found as :

1) OverallQual: Rates the overall material and finish of the house

[OverallQual shall be 6 and above ]
[OverallCondition shall be 5 and above]

2) BsmtExposure: Refers to walkout or garden level walls

[Good Exposure]

3) SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)

[Normal Sale]

Others renovatable features to consider :

Heating : Gas forced warm air furnace
Condition2 : Normal proximity to various conditions (if more than one is present)
RoofMatl: Roof material Standard (Composite) Shingle

2.4 Model Robustness :

Should it be used to evaluate which properties to buy and fix up?

Key tables for both Fix features and Mix features were reviewed again to have an overview for commentary purpose

Mean_Prediction_Gap_Mix = np.mean(merge_table_mix_fix['Absolute Prediction Gap in $_Mix'])
Mean_Prediction_Gap_Fix = np.mean(merge_table_mix_fix['Absolute Prediction Gap in $_Fix'])
Mean_Cost_of_reno_effect = np.mean([abs(v) for v in cost_effect_comparison['Cost of Renovation Effect']])

print('Mean_Prediction_Gap_Mix \t:{}'.format(Mean_Prediction_Gap_Mix))
print('Mean_Prediction_Gap_Fix \t:{}\n'.format(Mean_Prediction_Gap_Fix))
print('Absolute value of renovation effect [Mean_Cost_of_reno_effect]')
print(Mean_Cost_of_reno_effect)

On average, the renovation didn’t really add into higher saleprice justified based on fix features. Fix features were found still the key factors determining the SalePrice

print('Model_used \t\t\t: LassoCV')
print('Score_achived \t\t\t: {}'.format(score_fix))
print('No_of_variables_involved \t: {}'.format(no_of_variables_in_modeling2))

Based on the score, it was considered moderate enough to be used to assist decision making on investment potential. However, there were 186 variables needed in order to achive this score. Data completeness, data quality and data integrity were therefore very important and critical for model performance and accuracy. It was thus advisable to use it as reference and continue monitoring, timely adjustment would still be needed.

Project Summary :

Part 1 : Home Value Prediction based on Fixed Characteristics

'2nd model_all fix features :\nscore = {}\t no_of_feature = {}\t MAE:{}\t MSE:{}\t RMSE: {}'.format(score_fix, no_of_variables_in_modeling2, MAE_2, MSE_2, RMSE_2))
print('Top 2 estimators :\n1. GrLivArea: Above grade (ground) living area square feet; 2. Neighborhood[T.NridgHt]: Northridge Heights')

The model built for home value prediction was using Lasso CV with score of 0.88. It involved 186 feautes with various mean errors as shown above. Top 2 feature estimators were found as Ground Living Area and Neighborhood at Northridge Heights

Part 2 : Validation on value of changeable property characteristics unexplained by the fixed ones.

Top 3 highest coefficients could be used as estimators for higher achievable predicted saleprice as below

GrLivArea: Above grade (ground) living area square feet
OverallQual: Rates the overall material and finish of the house
Neighborhood: Northridge Heights

However, for cost of effects from renovatable features, Tope 3 estimators were found to be:

OverallQual: Rates the overall material and finish of the house
- in which [OverallQual shall be 6 and above ] and [OverallCondition shall be 5 and above]
BsmtExposure: Refers to walkout or garden level walls
- in which condition should be rated as Gd ( Good Exposure )
SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)
- in which condition to be at least as Normal Sale type

Others renovatable features to consider including the followings :

Heating : To be with Gas forced warm air furnace
Condition2 : Normal proximity to various conditions (if more than one is present)
RoofMatl: To have standard ( composite ) Shingle roof material

For return of investment on renovatable features, the difference between fix features vs renovatable features were found as below :

The renovation didn’t really add into higher saleprice justified based on fix features. Fix features were found still the key factors determining the SalePrice

As for overall model performance :

it was considered moderate enough to be used to assist decision making on investment potential. However, there were 186 variables needed in order to achive this score. Data completeness, data quality and data integrity were therefore very important and critical for model performance and accuracy. It was thus advisable to use it as reference and continue monitoring, timely adjustment would still be needed.