This project utilized an existing dataset ( Ames Housing Data @ Kaggle ) to predict the residential house prices. Through this project, algorithm was developed to reliably estimate the value of residential houses based on fixed features. Characteristics of the houses that the company could cost-effectively change/renovate were identified. Dataset was cleaned and analyzed thoroughly with comprehensive feature engineering. Model was then trained on pre-2010 data and tested its performance using 2010 housing data. The final model performance was characterized and the key features affecting the price were assessed and compared.
The project was splitted into 2 main parts
Part 1 : Estimating the value of homes from fixed characteristics
- Fixed characteristics refer to features that would involve major construction of the house
Part 2 : Determine any value of changeable property characteristics unexplained by the fixed ones.
- The effects in dollars of the renovate-able features were evaluated. By using the model obtained in Part 1, a review of its appropriateness to be used to assist the decision making if to buy/invest in the property was assessed. The variance in price remaining explainable by those features was therefore used as an indicator to justify such investment potentials.
Part 1 :
Estimating the Value of Houses from Fixed Characteristics
1.1 Data Preparation & Cleaning
- Filter raw dataset to have only residential related data
Percentage of missing values for affected columns was calculated and tabulated as below:
Columns with >45% missing values would be dropped in view of any replacement values would be assumption that impacted the accuracy significantly
Statistic of the remaining columns was reviewed in order to decide the best estimated values for those Na values. The review outcomes lead to the following decisions:
- for LotFrontage : mean and median were both at ~same value, the missing value to be replaced using median value
- for MasVnrArea : since there was only 8 Na and 50% of data is at 0, missing value to be replaced with median value of 0
- for MasVnrType : missing values were found corresponding to MasVnrArea. Since majority of data is None and MasVrnArea at 0, missing value to be replaced as None type
- for Bsmt related columns : data dictionary showed NA as No Basement. Further confirmation showed corresponding numerical columns with with Bsmt 0, missing value to be replaced with NA type
- for Electrical : since majority was under SBrkr, missing value to be replaced as SBrkr. This wouldn’t affect the overall distribution
- for Garage related columns : data dictionary showed NA as No Garage. Further confirmation showed corresponding numerical columns with Garage 0 or Na, missing value to be replaced with NA type and GarageYrBlt as 0
After imputation had been done, verification was carried out to check if there was any escapee during the replacement process
Result showed no more missing value. Data set was clean for further analysis.
1.2 : Identify fixed features for analysis
Fixed features refer to those features of the house that would involve major construction. The final fixed features were screened as below :
1.3 : Exploratory Data Analysis
Visualization of the top 5 highly correlated variables :
1.4 : Modeling
MSZoning[T.RH] | MSZoning[T.RL] | MSZoning[T.RM] | Street[T.Pave] | LotShape[T.IR2] | LotShape[T.IR3] | LotShape[T.Reg] | LandContour[T.HLS] | LandContour[T.Low] | LandContour[T.Lvl] | Utilities[T.NoSeWa] | LotConfig[T.CulDSac] | LotConfig[T.FR2] | LotConfig[T.FR3] | LotConfig[T.Inside] | LandSlope[T.Mod] | LandSlope[T.Sev] | Neighborhood[T.Blueste] | Neighborhood[T.BrDale] | Neighborhood[T.BrkSide] | Neighborhood[T.ClearCr] | Neighborhood[T.CollgCr] | Neighborhood[T.Crawfor] | Neighborhood[T.Edwards] | Neighborhood[T.Gilbert] | Neighborhood[T.IDOTRR] | Neighborhood[T.MeadowV] | Neighborhood[T.Mitchel] | Neighborhood[T.NAmes] | Neighborhood[T.NPkVill] | Neighborhood[T.NWAmes] | Neighborhood[T.NoRidge] | Neighborhood[T.NridgHt] | Neighborhood[T.OldTown] | Neighborhood[T.SWISU] | Neighborhood[T.Sawyer] | Neighborhood[T.SawyerW] | Neighborhood[T.Somerst] | Neighborhood[T.StoneBr] | Neighborhood[T.Timber] | Neighborhood[T.Veenker] | BldgType[T.2fmCon] | BldgType[T.Duplex] | BldgType[T.Twnhs] | BldgType[T.TwnhsE] | MasVnrType[T.BrkFace] | MasVnrType[T.None] | MasVnrType[T.Stone] | Foundation[T.CBlock] | Foundation[T.PConc] | Foundation[T.Slab] | Foundation[T.Stone] | Foundation[T.Wood] | BsmtExposure[T.Gd] | BsmtExposure[T.Mn] | BsmtExposure[T.NA] | BsmtExposure[T.No] | Functional[T.Maj2] | Functional[T.Min1] | Functional[T.Min2] | Functional[T.Mod] | Functional[T.Sev] | Functional[T.Typ] | GarageType[T.Attchd] | GarageType[T.Basment] | GarageType[T.BuiltIn] | GarageType[T.CarPort] | GarageType[T.Detchd] | GarageType[T.NA] | GarageFinish[T.NA] | GarageFinish[T.RFn] | GarageFinish[T.Unf] | PavedDrive[T.P] | PavedDrive[T.Y] | SaleType[T.CWD] | SaleType[T.Con] | SaleType[T.ConLD] | SaleType[T.ConLI] | SaleType[T.ConLw] | SaleType[T.New] | SaleType[T.Oth] | SaleType[T.WD] | MSSubClass | LotFrontage | LotArea | YearBuilt | YearRemodAdd | MasVnrArea | BsmtUnfSF | TotalBsmtSF | FirstFlrSF | SecondFlrSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | ThirdSsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 60.0 | 65.0 | 8450.0 | 2003.0 | 2003.0 | 196.0 | 150.0 | 856.0 | 856.0 | 854.0 | 1710.0 | 1.0 | 0.0 | 2.0 | 1.0 | 3.0 | 1.0 | 8.0 | 0.0 | 2003.0 | 2.0 | 548.0 | 0.0 | 61.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2008.0 | 208500.0 |
1 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 20.0 | 80.0 | 9600.0 | 1976.0 | 1976.0 | 0.0 | 284.0 | 1262.0 | 1262.0 | 0.0 | 1262.0 | 0.0 | 1.0 | 2.0 | 0.0 | 3.0 | 1.0 | 6.0 | 1.0 | 1976.0 | 2.0 | 460.0 | 298.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 2007.0 | 181500.0 |
2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 60.0 | 68.0 | 11250.0 | 2001.0 | 2002.0 | 162.0 | 434.0 | 920.0 | 920.0 | 866.0 | 1786.0 | 1.0 | 0.0 | 2.0 | 1.0 | 3.0 | 1.0 | 6.0 | 1.0 | 2001.0 | 2.0 | 608.0 | 0.0 | 42.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 2008.0 | 223500.0 |
3 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 70.0 | 60.0 | 9550.0 | 1915.0 | 1970.0 | 0.0 | 540.0 | 756.0 | 961.0 | 756.0 | 1717.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | 7.0 | 1.0 | 1998.0 | 3.0 | 642.0 | 0.0 | 35.0 | 272.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2006.0 | 140000.0 |
4 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 60.0 | 84.0 | 14260.0 | 2000.0 | 2000.0 | 350.0 | 490.0 | 1145.0 | 1145.0 | 1053.0 | 2198.0 | 1.0 | 0.0 | 2.0 | 1.0 | 4.0 | 1.0 | 9.0 | 1.0 | 2000.0 | 3.0 | 836.0 | 192.0 | 84.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12.0 | 2008.0 | 250000.0 |
Visualization of residual/gap of predicted vs actual sales price and their coefficient differences
Partial view of the coeefficient graph
Part 2 :
Determine any value of changeable property characteristics unexplained by the fixed ones.
Workflow in P1 was repeated, but for part 2 all variables ( fix + nonfix ) were included into calculation.
For modeling, Grid Search on Lasso was used to assist finding the best parameters for price prediction.
Model score was found very close to earlier score obtained in Part 1, indicating there could be little/no difference caused by renovation effect. Model using LassoCV was finalized and continued with the calculation on cost of renovation effects.
2.1 The effect in dollars of SalePrice with renovate-able features
Relevant information was tabulated to summarize the effect in dollars on SalePrice with renovatable features
Partial view of the tabulated cost of renovation effects :
Actual SalePrice_Mix | Predicted SalePrice_Mix | Predicted SalePrice_Fix | Cost of Renovation Effect | % of Increase due to Renovation | |
---|---|---|---|---|---|
84 | 55000.0 | 61664.957714 | 42891.117349 | 18773.840365 | 30.44 |
138 | 164000.0 | 237093.364080 | 179208.778886 | 57884.585194 | 24.41 |
77 | 88000.0 | 120260.063071 | 93523.405892 | 26736.657180 | 22.23 |
120 | 100000.0 | 122506.492606 | 96293.063784 | 26213.428821 | 21.40 |
99 | 155000.0 | 156570.449387 | 135831.029298 | 20739.420089 | 13.25 |
147 | 139000.0 | 138312.341658 | 120976.916906 | 17335.424753 | 12.53 |
Outlier to the left( negative effect of renovate-able feature ) could be due to odd/non prime location in which original house condition is also possibly bad causing high renovation cost needed
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | FirstFlrSF | SecondFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | ThirdSsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
160 | 161 | 20 | RL | 70.0 | 11120 | Pave | IR1 | Lvl | AllPub | CulDSac | Gtl | Veenker | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1984 | 1984 | Gable | CompShg | Plywood | Plywood | None | 0.0 | TA | TA | PConc | Gd | TA | No | BLQ | 660 | Unf | 0 | 572 | 1232 | GasA | TA | Y | SBrkr | 1232 | 0 | 0 | 1232 | 0 | 0 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 0 | Attchd | 1984.0 | Unf | 2 | 516 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2008 | WD | Normal | 162500 |
From data dictionary, it was found that :
- 20 : 1-STORY 1946 & NEWER ALL STYLES => old house
- RL : Residential Low Density => non prime location
- IR1 : Slightly irregular => irregular lot size
- Lvl : Near Flat/Level
- CulDSac : Cul-de-sac
- Gtl : Gentle slope
- 1Fam : Single-family Detached
- OverallQual : 6 Above Average => quality is still within the mean quality level of majority
The closest explanation to the outlier point in which the unit had low predicted saleprice even after factoring renovate-able features was due to the house physical condition of being ancient old house with irregular size situated at non-prime location.
2.2 Use of Coefficients for Decision Making : To buy / Not to Buy
For ease of understanding, the coefficients were tabulated and visualized through graphical view
Partial view of the tabulated coefficients as below :
Partial view of the graphical representation of coefficients as below :
Top 3 highest coefficients could be used as estimators for higher achievable predicted saleprice
- GrLivArea: Above grade (ground) living area square feet
- OverallQual: Rates the overall material and finish of the house
- Neighborhood: Northridge Heights
2.3 Variance in Price Remaining Explainable by Features
[ Which renovate-able features are effective in increasing predicted SalePrice ]
To further identify & finalize real renovatable_features excluding physical conditions:
After matching the right index corresponding to the coefficients, the effect of the variance was then again tabulated and visualized through graphical view
Partial view of the tabulated coefficients as below :
Graph above showed the Top 3 estimators of renovatable features were :
- OverallQual: Rates the overall material and finish of the house
- BsmtExposure: Gd Good Exposure
- SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)
As data at this point already filtered down to only those with positive increase in SalePrice due to renovation, The plot above showed that :
- Overall Qual shall be 6 and above
- Overall Condition shall be 5 and above
A sub-dataframe was formed in order to compare renovatable features vs total no. of units with positive increase in price due to that feature
Partial view of the tabulated sub-dataframe as below :
Partial view of the renovatable features as below :
Top 3 estimators of renovatable features were found as :
1) OverallQual: Rates the overall material and finish of the house
- [OverallQual shall be 6 and above ]
- [OverallCondition shall be 5 and above]
2) BsmtExposure: Refers to walkout or garden level walls
- [Good Exposure]
3) SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)
- [Normal Sale]
Others renovatable features to consider :
- Heating : Gas forced warm air furnace
- Condition2 : Normal proximity to various conditions (if more than one is present)
- RoofMatl: Roof material Standard (Composite) Shingle
2.4 Model Robustness :
Should it be used to evaluate which properties to buy and fix up?
Key tables for both Fix features and Mix features were reviewed again to have an overview for commentary purpose
On average, the renovation didn’t really add into higher saleprice justified based on fix features. Fix features were found still the key factors determining the SalePrice
Based on the score, it was considered moderate enough to be used to assist decision making on investment potential. However, there were 186 variables needed in order to achive this score. Data completeness, data quality and data integrity were therefore very important and critical for model performance and accuracy. It was thus advisable to use it as reference and continue monitoring, timely adjustment would still be needed.
Project Summary :
Part 1 : Home Value Prediction based on Fixed Characteristics
The model built for home value prediction was using Lasso CV with score of 0.88. It involved 186 feautes with various mean errors as shown above. Top 2 feature estimators were found as Ground Living Area and Neighborhood at Northridge Heights
Part 2 : Validation on value of changeable property characteristics unexplained by the fixed ones.
Top 3 highest coefficients could be used as estimators for higher achievable predicted saleprice as below
- GrLivArea: Above grade (ground) living area square feet
- OverallQual: Rates the overall material and finish of the house
- Neighborhood: Northridge Heights
However, for cost of effects from renovatable features, Tope 3 estimators were found to be:
- OverallQual: Rates the overall material and finish of the house
- in which [OverallQual shall be 6 and above ] and [OverallCondition shall be 5 and above]
- BsmtExposure: Refers to walkout or garden level walls
- in which condition should be rated as Gd ( Good Exposure )
- SaleCondition: Partial Home was not completed when last assessed (associated with New Homes)
- in which condition to be at least as Normal Sale type
Others renovatable features to consider including the followings :
- Heating : To be with Gas forced warm air furnace
- Condition2 : Normal proximity to various conditions (if more than one is present)
- RoofMatl: To have standard ( composite ) Shingle roof material
For return of investment on renovatable features, the difference between fix features vs renovatable features were found as below :
The renovation didn’t really add into higher saleprice justified based on fix features. Fix features were found still the key factors determining the SalePrice
As for overall model performance :
it was considered moderate enough to be used to assist decision making on investment potential. However, there were 186 variables needed in order to achive this score. Data completeness, data quality and data integrity were therefore very important and critical for model performance and accuracy. It was thus advisable to use it as reference and continue monitoring, timely adjustment would still be needed.