The purpose of this project is to develop a model for the Sale Price of a home in Ames, Iowa. Based on the other variables in the data set, we will use this model to help us predict housing price.
You can get the dataset used in the analysis by downloading it at the CRAN website.
In the analysis, we will touch on concepts such as exploratory data analysis, data preprocessing, model selection, and model diagnostics.
The analysis was done using R, you will need the following packages to run the code.
1.) MASS
2.) ggplot2
3.) Sleuth2
install.packages("MASS")
install.packages("ggplot2")
install.packages("Sleuth2")
There is a lot of variables in this data set. One thing I always like to do is look at the data structure and summary of the dataset. Doing this allows me to see how many NaN values are in the data set and the unique data types I will be working with.
# Execute Summary and Structure of Data:
summary(data)
str(data)
It's good practice to plot all of our independent variables against our dependent variable SalePrice so we can see if there is any correlations between the two variables. This also can help us eliminate variables right away if we see no correlation between the two variables.
We want to split our data into train and test sets: for more information on this please refer to Train/Test_Split.
### Split Training Set 70/30
train <- sample(2258,1800)
test <- (c(1:2258)[-train])
There are many different strategies one can utilize when trying to determine the best predictors for our dependent variable SalePrice. You could use:
1.) forward stepwise regression
2.) best subset
3.) backwards elimination
and many more.
For this specific demonstration, I'll be looking at the pvalue for each coefficient. If the pvalue is greater than 0.05, I will remove the variable from the model and then rerun the model until all I am left with is variables that are considered statistically signficiant. After executing the above process, my final model with continious variables only is the following:
Some of diagnostic plots we can look at is the fitted vs the residuals, testing normality of the model and the Shapiro-Wilkins test. For this project, I will not go into the break down for each of these check diagnostic plots but will produce a future project going more into depth over this topic.
For now, I will say that we want the variance for our residuals vs fitted plot to be constant. We can see here that the variance is constantly changing. One method we can do to try and fix this is using the boxcox method to transform our data.
# Run boxcox transformation to help normalize data:
boxcox(SalePrice~Overall.Qual + Year.Built + Year.Remod.Add + BsmtFin.SF.1 + Total.Bsmt.SF + X1st.Flr.SF + Gr.Liv.Area + TotRms.AbvGrd +Garage.Yr.Blt + Wood.Deck.SF, data = num.ames)
The below output shows that our lambda value is closes to zero. Therefore, we will take the log transformation of our dependent variable SalePrice.
While we can still see clusters of data points in some portions of the output, we can see that the variance of our model looks much better after taking the log transformation.
Using the anova function in R, I will fit one categorical variable at a time to the numeric only model until all of the variables remaining in my model are statistically signficiant.
After running the anova function, the following is my final model and predictors:
The variance in our residuals vs fitted plot looks consistent in our final model.
We can see some skewness in our normal distribution plot but overall our model looks good when testing normality.
The final model shows the following upper and lower bound housing price prediction: