Using various classification methods, the objective of the project is to develop a model that will predict the variable "RainTomorrow" which equals "Yes" if it rained the next day and "No" if it did not rain the next day.
You can get the dataset used in the analysis by downloading it from my GitHub website.
The analysis was done using R, you will need the following packages to run the code.
- library(ROCR)
- library(pROC)
- library(rpart)
- library(rpart.plot)
- library(lattice)
- library(naivebayes)
- library(nnet)
- library(NeuralNetTools)
install.packages("ROCR")
install.packages("rpart")
I'm going to build four separate classification models using the following methods:
- Logistic Regression
- Decision Tree Classification
- Naive Bayes Classification
- Neural Network Classification
The first model I will build is a logistic regression model. I'm going to use forward stepwise regression to determine the features most important for my model. Our final model will be based off the best AIC value. We want to choose the model that has the lowest AIC value. For more information on forward stepwise and AIC, please see the following link. Stepwise Regression
The following is the optimal model output for the logistic regression model:
Optimal Model Based On Best AIC:
Logsitic Regression Model Output
Confusion Matrix:
ROC Plot:
False Positive Plot:
False Negative:
Calculating:
- False Positive Rate
- False Negative Rate
- Overall Error Rate
- Sensitivity
- Specificity
FPR = 397/(1420+397)*100
- False Positive Rate = 21%
FNR = 109/(109+452)*100
- False Negative Rate = 22%
Total = 1420+109+397+452 Overall_Error_Rate = ((397+109)/Total)*100
- Overall_Error_Rate = 21%
Sensitivity = 1 - 0.22 = 0.78% So a little over 78% of the time these weather stations are correctly predicting rain tomorrow.
Specificity = 1 - 0.21 = 0.79% So a little over 79% of the time these weather stations are correctly predicting when it will not rain tomorrow.
ROC Curve and Area Under Curve:
Based on the above plots, I believe that a good cut off point is 0.02 as we start to see the curves flatten around 0.02
Next we will want to check the AUC (area under curve) for our model. We want the ROC curve as far as possible towards the left. Good rule of thumb is if our AUC > 0.7 means our model is good enough. Our AUC is 88% so we have a good enough model.
Now I will build the second classification model using decision trees.
First Tree:
Model CP Error Output:
The column labeled “xerror” is the cross-validation error for different subsets of the tree. What we are looking for is a low cross validation error. The column labeled “xstd” gives the standard errors of the cross validation errors. This column can be used to get an idea of how much the values in the “xerror” column could reasonably vary.
Looking at the xerror column, we can see that xerror starts to level out with the third value 0.76981 and isn't much different in error rate vs the 0.70513 value. Thus, the tree corresponding to the third row is the tree that gives us nearly the lowest cross validation error and least complicated model using a cp of 0.017337. We will use this to prune our tree as we do not want to overfit the model.
Tree Model 1 Summary Statistics:
Prune Tree:
Error Rate Prune Tree:
Now will look at the cut off points for our predictions then compute the error metrics using the confusion matrix:
ROC Plot
False Positive Rate Plot:
False Negative Rate Plot:
-
fpr = 384/(1474+384)
-
False_Positive_Rate = 21%
-
fnr = 165/(165+355)
-
False_Negative_Rate = 32%
-
Overall_Error_Rate = (384+165)/2378
-
Overall_Error_Rate = 23%
-
Sensitivity = 1 - False_Negative_Rate
-
Sensitivity = 68%
-
Specificity = 1 - False_Positive_Rate
-
Sensitivity = 68%
Sensitivity is 68% so roughly around 68% of the time these weather stations are correctly predicting rain tomorrow.
Specificity is 68% so roughly around 68% of the time these weather stations are correctly predicting no rain tomorrow.
Now I will create a model using naivebayes as my classification model and compute the error rates.
Preliminary Analysis: Now I will conduct histograms of the continuous variables to see if there is any relation within the variables. That way I can eliminate variables that I don’t believe will help my model. We want the Yes and No levels of the histogram to look different; this will give us a good indication that we should include this variable in our model.
Below is a sample of some of the variables plotted:
Variables to keep: Looking at the Pearson’s Chi-squared test for our categorical variables, we want to keep all three of these variables, WindGustDir, WindDir3pm, and RainToday for their p-values are less than 0.05.
Model Output Plots:
After looking at the print out of our plots, I have decided to remove WindGustDir and variable WindDir3pm as these “Yes” and “No” are not that much different and they wouldn’t help make an impact on our model.
Error Rate Plots and ROC:
-
fpr = 405/(1453+405)
-
False_Positive_Rate = 22%
-
fnr = 107/(107+413)
-
False_Negative_Rate = 21%
-
Overall_Error_Rate = (405+107)/2378
-
Overall_Error_Rate = 22%
-
Sensitivity = 1 - False_Negative_Rate
-
Sensitivity = 79%
-
Specificity = 1 - False_Positive_Rate
-
Sensitivity = 78%
Sensitivity is 79% so roughly around 79% of the time these weather stations are correctly predicting rain tomorrow.
Specificity is 78% so roughly around 78% of the time these weather stations are correctly predicting no rain tomorrow.
My final model will be a neural network.
False Positive Rate:
ROC Plot:
Based on the above plots, the cutoff rate looks to be around 0.3.
Confusion Matrix:
fpr = (262)/(1564+262) False_Positive_Rate = 14%
fnr = 162/(162+390) False_Negative_Rate = 30%
Overall Error Rate = (262+162)/2378 Overall Error Rate = 18%
-
Sensitivity = 1 - False_Negative_Rate
-
Sensitivity = 1-30%
-
Sensitivity = 70%
-
Specificity = 1 - False_Positive_Rate
-
Sensitivity = 1 - 14%
-
Sensitivity = 86%
Sensitivity is 70% so roughly around 70% of the time these weather stations are correctly predicting rain tomorrow.
Specificity is 86% so roughly around 86% of the time these weather stations are correctly predicting no rain tomorrow.
After building the four classification models, the neural network model performed best with the lowest error rate at 18%. Sensitivity is 70% and specificity is 86% which are great prediction results.