Simon Monk
October 2, 2016
Load the required R packages and set seed to ensure reproducability of work.
library(caret)
library(rattle)
library(rpart)
library(randomForest)
library(doParallel)
set.seed(213)
Download and store pre-processed training and test data sets.
trainingDataSet <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
validationDataSet <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
Parition the training set into a training set and a test set to estimate the out-of-sample error
inTrainPartition <- createDataPartition(y=trainingDataSet$classe, p=0.6, list=FALSE)
training <- trainingDataSet[inTrainPartition, ]
testing <- trainingDataSet[-inTrainPartition, ]
Replace NA values with zero in the training, testing, and validation test sets.
training[is.na(training)] <- 0
testing[is.na(testing)] <- 0
validationDataSet[is.na(validationDataSet)] <- 0
Identify variables with near zero variance using the nearZeroVar function on the training data set. Remove these identified variables from each of the three data sets.
training.nearZero <- nearZeroVar(training, saveMetrics=TRUE)
myTrainingSet <- training[!training.nearZero$nzv]
myTestingSet <- testing[!training.nearZero$nzv]
myValidationSet <- validationDataSet[!training.nearZero$nzv]
Remove the index (first) column from each of the datasets as this may confuse our model.
myTrainingSet <- myTrainingSet[c(-1)]
myTestingSet <- myTestingSet[c(-1)]
myValidationSet <- myValidationSet[c(-1)]
We now have 58 of the original 160 variables left in each of our datasets:
dim(myTrainingSet)
## [1] 11776 58
dim(myTestingSet)
## [1] 7846 58
dim(myValidationSet)
## [1] 20 58
Let's try fitting and plotting a classication tree model:
mod.ClassificationTree <- train(classe ~ ., data=myTrainingSet, method="rpart")
fancyRpartPlot(mod.ClassificationTree$finalModel)
Let's estimate our out-of-sample accuracy by predicting with the classification tree on the test data set and comparing the actual test set classifications. We can see that this gives us an out-of-sample estimate of accuracy of approximately 50%.
pred.classificationTree <- predict(mod.ClassificationTree, myTestingSet)
confusionMatrix(pred.classificationTree, myTestingSet$classe)$overall[1]
## Accuracy
## 0.4963038
With such a low estimate of accuracy, lets try a new model: Random Forest. First, lets build our model on the data. I've included the registerDoParallel() function to improve performance:
mod.rf <- train(classe ~ ., data=myTrainingSet, method="rf", ntree = 10)
Our new Random Forest model gives us an estimated out-of-sample accuracy of approximately 99.8%:
pred.rf <- predict(mod.rf, myTestingSet)
confusionMatrix(pred.rf, myTestingSet$classe)$overall[1]
## Accuracy
## 0.9979607
With such high estimated accuracy, lets predict on our validation set. See predictions below:
pred.rf.validation <- predict(mod.rf, myValidationSet)
pred.rf.validation
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E