Skip to content

Develop a way to predict the variable RainTomorrow which equals Yes if it rained the next day and No if it did not rain the next day. We will use logistic regression, decision tree classification, naive bayes classification, and neural networks to build four separate models. We will use the confusion matrix as our decider on which model to use.

Notifications You must be signed in to change notification settings

NavarroAlexKU/Predicting-Rain-tomrrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Rain Tomorrow Using Classification Modeling

Using various classification methods, the objective of the project is to develop a model that will predict the variable "RainTomorrow" which equals "Yes" if it rained the next day and "No" if it did not rain the next day.

ScreenShot

Authors

🔗 Social Media Links

linkedin

Documentation

You can get the dataset used in the analysis by downloading it from my GitHub website.

Data

Installation & Packages:

App Screenshot The analysis was done using R, you will need the following packages to run the code.

  • library(ROCR)
  • library(pROC)
  • library(rpart)
  • library(rpart.plot)
  • library(lattice)
  • library(naivebayes)
  • library(nnet)
  • library(NeuralNetTools)
install.packages("ROCR")

install.packages("rpart")

Modeling:

I'm going to build four separate classification models using the following methods:

  • Logistic Regression
  • Decision Tree Classification
  • Naive Bayes Classification
  • Neural Network Classification

Logistic Regression:

The first model I will build is a logistic regression model. I'm going to use forward stepwise regression to determine the features most important for my model. Our final model will be based off the best AIC value. We want to choose the model that has the lowest AIC value. For more information on forward stepwise and AIC, please see the following link. Stepwise Regression

The following is the optimal model output for the logistic regression model:

Optimal Model Based On Best AIC:

App Screenshot

Logsitic Regression Model Output

App Screenshot

Confusion Matrix:

App Screenshot

ROC Plot:

App Screenshot

False Positive Plot:

App Screenshot

False Negative:

App Screenshot

App Screenshot

Calculating:

  • False Positive Rate
  • False Negative Rate
  • Overall Error Rate
  • Sensitivity
  • Specificity

FPR = 397/(1420+397)*100

  • False Positive Rate = 21%

FNR = 109/(109+452)*100

  • False Negative Rate = 22%

Total = 1420+109+397+452 Overall_Error_Rate = ((397+109)/Total)*100

  • Overall_Error_Rate = 21%

Sensitivity = 1 - 0.22 = 0.78% So a little over 78% of the time these weather stations are correctly predicting rain tomorrow.

Specificity = 1 - 0.21 = 0.79% So a little over 79% of the time these weather stations are correctly predicting when it will not rain tomorrow.

ROC Curve and Area Under Curve:

App Screenshot

Based on the above plots, I believe that a good cut off point is 0.02 as we start to see the curves flatten around 0.02

Next we will want to check the AUC (area under curve) for our model. We want the ROC curve as far as possible towards the left. Good rule of thumb is if our AUC > 0.7 means our model is good enough. Our AUC is 88% so we have a good enough model.

App Screenshot

Decision Tree Classifier:

Now I will build the second classification model using decision trees.

First Tree:

App Screenshot

Model CP Error Output:

App Screenshot

The column labeled “xerror” is the cross-validation error for different subsets of the tree. What we are looking for is a low cross validation error. The column labeled “xstd” gives the standard errors of the cross validation errors. This column can be used to get an idea of how much the values in the “xerror” column could reasonably vary.

Looking at the xerror column, we can see that xerror starts to level out with the third value 0.76981 and isn't much different in error rate vs the 0.70513 value. Thus, the tree corresponding to the third row is the tree that gives us nearly the lowest cross validation error and least complicated model using a cp of 0.017337. We will use this to prune our tree as we do not want to overfit the model.

Tree Model 1 Summary Statistics:

App Screenshot

Prune Tree:

App Screenshot

Error Rate Prune Tree:

App Screenshot

Now will look at the cut off points for our predictions then compute the error metrics using the confusion matrix:

ROC Plot

App Screenshot

False Positive Rate Plot:

App Screenshot

False Negative Rate Plot:

App Screenshot

Confusion Matrix: Calculation:

App Screenshot

App Screenshot

  • fpr = 384/(1474+384)

  • False_Positive_Rate = 21%

  • fnr = 165/(165+355)

  • False_Negative_Rate = 32%

  • Overall_Error_Rate = (384+165)/2378

  • Overall_Error_Rate = 23%

  • Sensitivity = 1 - False_Negative_Rate

  • Sensitivity = 68%

  • Specificity = 1 - False_Positive_Rate

  • Sensitivity = 68%

Sensitivity is 68% so roughly around 68% of the time these weather stations are correctly predicting rain tomorrow.

Specificity is 68% so roughly around 68% of the time these weather stations are correctly predicting no rain tomorrow.

NaiveBayes:

Now I will create a model using naivebayes as my classification model and compute the error rates.

Preliminary Analysis: Now I will conduct histograms of the continuous variables to see if there is any relation within the variables. That way I can eliminate variables that I don’t believe will help my model. We want the Yes and No levels of the histogram to look different; this will give us a good indication that we should include this variable in our model.

Below is a sample of some of the variables plotted:

App Screenshot

Variables to keep: Looking at the Pearson’s Chi-squared test for our categorical variables, we want to keep all three of these variables, WindGustDir, WindDir3pm, and RainToday for their p-values are less than 0.05.

App Screenshot

App Screenshot

Model Output Plots:

App Screenshot

App Screenshot

After looking at the print out of our plots, I have decided to remove WindGustDir and variable WindDir3pm as these “Yes” and “No” are not that much different and they wouldn’t help make an impact on our model.

Error Rate Plots and ROC:

App Screenshot

App Screenshot

App Screenshot

  • fpr = 405/(1453+405)

  • False_Positive_Rate = 22%

  • fnr = 107/(107+413)

  • False_Negative_Rate = 21%

  • Overall_Error_Rate = (405+107)/2378

  • Overall_Error_Rate = 22%

  • Sensitivity = 1 - False_Negative_Rate

  • Sensitivity = 79%

  • Specificity = 1 - False_Positive_Rate

  • Sensitivity = 78%

Sensitivity is 79% so roughly around 79% of the time these weather stations are correctly predicting rain tomorrow.

Specificity is 78% so roughly around 78% of the time these weather stations are correctly predicting no rain tomorrow.

Neural Network:

My final model will be a neural network.

False Positive Rate:

App Screenshot

ROC Plot:

App Screenshot

Based on the above plots, the cutoff rate looks to be around 0.3.

Confusion Matrix:

App Screenshot

fpr = (262)/(1564+262) False_Positive_Rate = 14%

fnr = 162/(162+390) False_Negative_Rate = 30%

Overall Error Rate = (262+162)/2378 Overall Error Rate = 18%

  • Sensitivity = 1 - False_Negative_Rate

  • Sensitivity = 1-30%

  • Sensitivity = 70%

  • Specificity = 1 - False_Positive_Rate

  • Sensitivity = 1 - 14%

  • Sensitivity = 86%

Sensitivity is 70% so roughly around 70% of the time these weather stations are correctly predicting rain tomorrow.

Specificity is 86% so roughly around 86% of the time these weather stations are correctly predicting no rain tomorrow.

Final Output:

After building the four classification models, the neural network model performed best with the lowest error rate at 18%. Sensitivity is 70% and specificity is 86% which are great prediction results.

About

Develop a way to predict the variable RainTomorrow which equals Yes if it rained the next day and No if it did not rain the next day. We will use logistic regression, decision tree classification, naive bayes classification, and neural networks to build four separate models. We will use the confusion matrix as our decider on which model to use.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published