Skip to content

Latest commit

 

History

History
147 lines (98 loc) · 3.47 KB

PracticleMachineLearning.md

File metadata and controls

147 lines (98 loc) · 3.47 KB

Practicle Machine Learning

Simon Monk
October 2, 2016

Load the required R packages and set seed to ensure reproducability of work.

library(caret)
library(rattle)
library(rpart)
library(randomForest)
library(doParallel)
set.seed(213)

Download and store pre-processed training and test data sets.

  trainingDataSet <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
  validationDataSet <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

Parition the training set into a training set and a test set to estimate the out-of-sample error

  inTrainPartition <- createDataPartition(y=trainingDataSet$classe, p=0.6, list=FALSE)
  training <- trainingDataSet[inTrainPartition, ]
  testing <- trainingDataSet[-inTrainPartition, ]

Replace NA values with zero in the training, testing, and validation test sets.

  training[is.na(training)] <- 0
  testing[is.na(testing)] <- 0
  validationDataSet[is.na(validationDataSet)] <- 0

Identify variables with near zero variance using the nearZeroVar function on the training data set. Remove these identified variables from each of the three data sets.

  training.nearZero <- nearZeroVar(training, saveMetrics=TRUE)
  myTrainingSet <- training[!training.nearZero$nzv]
  myTestingSet <- testing[!training.nearZero$nzv]
  myValidationSet <- validationDataSet[!training.nearZero$nzv]

Remove the index (first) column from each of the datasets as this may confuse our model.

  myTrainingSet <- myTrainingSet[c(-1)]
  myTestingSet <- myTestingSet[c(-1)]
  myValidationSet <- myValidationSet[c(-1)]

We now have 58 of the original 160 variables left in each of our datasets:

  dim(myTrainingSet)
## [1] 11776    58
  dim(myTestingSet)
## [1] 7846   58
  dim(myValidationSet)
## [1] 20 58

Let's try fitting and plotting a classication tree model:

  mod.ClassificationTree <- train(classe ~ ., data=myTrainingSet, method="rpart")
  fancyRpartPlot(mod.ClassificationTree$finalModel)

plot of chunk unnamed-chunk-8

Let's estimate our out-of-sample accuracy by predicting with the classification tree on the test data set and comparing the actual test set classifications. We can see that this gives us an out-of-sample estimate of accuracy of approximately 50%.

  pred.classificationTree <- predict(mod.ClassificationTree, myTestingSet)
  confusionMatrix(pred.classificationTree, myTestingSet$classe)$overall[1]
##  Accuracy 
## 0.4963038

With such a low estimate of accuracy, lets try a new model: Random Forest. First, lets build our model on the data. I've included the registerDoParallel() function to improve performance:

  mod.rf <- train(classe ~ ., data=myTrainingSet, method="rf", ntree = 10)

Our new Random Forest model gives us an estimated out-of-sample accuracy of approximately 99.8%:

  pred.rf <- predict(mod.rf, myTestingSet)
  confusionMatrix(pred.rf, myTestingSet$classe)$overall[1]
##  Accuracy 
## 0.9979607

With such high estimated accuracy, lets predict on our validation set. See predictions below:

  pred.rf.validation <- predict(mod.rf, myValidationSet)
  pred.rf.validation
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E