QuantifiedSelfAnalysis.Rmd

---
title: "Quantified Self Project - An Exercise in Machine Learning"
author: "Len Greski"
date: "December 3, 2015"
output: 
  html_document: 
    keep_md: yes
---

```{r ref.label="dataDownload", echo=FALSE}
# download data for analysis if necessary
```

```{r ref.label="readData", echo=FALSE}
# read the two data files so we can reference them in code 
```

## Executive Summary

Classification of data from the [Qualitative Activity Recognition of Weight Lifting Exercises](http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201) study to predict exercise quality for unknown observations from the study resulted in a 100% accuracy rate with a random forest technique. Key findings included:

* Fully 62.5% of the data in the dataset was unusable, due to the high rates of missing values,
* Of the remaining 60 variables, the top <N> explained <x%> of the variability in the exercise quality variable, `classe`, and
* A random forest model with <list parameters here> correctly identified 20 out of 20 unknown test cases. 

## Background

There is an explosion of data being generated by personal devices, ranging from smartphones to "wearable" computers and fitness trackers such as the *Fitbit, Jawbone Up, Moto 360, Nike Fuelband, Samsung Gear Fit* and most recently the *Apple Watch*. Scientists are using this data to form an emerging category of research: Human Activity Recognition (HAR). 

While most of the research in HAR is focused on identifying specific types of activities given a set of measurements from a smart device, relatively little attention has been paid to the quality of exercises as measured by these devices. As such, Wallace Uguilino, Eduardo Vellos, and Hugo Fuks developed a study to see whether they could classify the quality of exercises done by a set of six individuals.  

Our goal for this analysis is to use the Weight Lifting Exercises Dataset that was the subject of the research paper [Qualitative Activity Recognition of Weight Lifting Exercises](http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201), which was presented at the 4th Augmented Human (AH) International Conference in 2013. Details about the methodology for specifying correct execution of an exercise and tracking it may be found in the paper linked above. 

## Exploratory Data Analysis / Feature Selection

Per the research team:

> Six young health participants were asked to perform one set of 10 repititions (sic) of the 
> Unilateral Dumbell Biceps Curl in five different fashions: exactly according to the
> specification (Class A), throwing elbows to the front (Class B), lifting the dumbbell
> only halfway (Class C), lowering the dumbbell only halfway (Class D), and throwing the
> hips to the front (Class E).

The independent variables are a list of 153 variables collected from a belt sensor, an arm sensor, a forearm sensor, and a dumbbell sensor. 

The dependent variable, `classe`, is a categorical variable, with 16% to 28% of the observations in a given category, as illustrated below. 

```{r depVar, echo=FALSE}
Counts <- table(training$classe)
Percentages <- Counts / sum(Counts)
aFrame <- rbind(Counts,Percentages)
kable(aFrame)
countsByname <- table(training$classe,training$user_name)
```

Category A represents the exercises that were completed according to specification, about 28% of the total number of exercises measured across the six participants in the study. Exercise quality varies significantly within and between persons, as illustrated in the following barplot. 

```{r, echo=FALSE}
barplot(countsByname,
        xlab = "Subject",
        ylab = "Frequency",
        legend=rownames(countsByname),
        main="Exercise Quality by Subject",
        beside=TRUE
        )
```

A successful classification model will not only predict whether the exercise was completed correctly (classe A vs. B through E), but also correctly classify the type of error made if the exercise was completed in error. For the purposes of our assignment, our machine learning algorithm  must predict the values of 20 unknown observations. Therefore, we'll need a model with over 95% accuracy in order to achieve 20 successful classifications for the 20 observations, since the probability of achieving 20 out of 20 correct predictions is $p^{20}$, and $0.95^{20} = 0.36$. At 99% accuracy, we have a .80 probability of 20 out of 20 matches.

A run of summary statistics on the independent training dataset shows that 100 of the 160 variables in the data set are missing for all of the observations. We will eliminate these from the analysis because there is no way to devise a meaningful missing value imputation strategy for these variables. We will also remove the date and time variables (`raw_timestamp_part_1`, `raw_timestamp_part_2`, and `cvtd_timestamp`) and `new_window`, because `new_window` was distributed as 2% "yes" and 98% "no".  Therefore it would not likely be a good variable to classify exercises into exercise quality levels. We also include the factor variable representing each individual's name as part of the model, to see whether accounting for between person variability in the quality of the exercises is of any value in predicting the result. 

All of the remaining numeric variables have no missing values, so imputation of missing values is not required in order to increase the number of features included in the analysis. 

## Cross-Validation & OOB Estimation

To balance predictive power with a manageable time to build our models, we will use k-fold cross validation as our method for estimating our out of sample error. We will select 5 folds, meaning that the our classifiation algorithms will group the data into five subsamples, estimating five models where one model is saved as the hold out group while the remaining four subsamples are used to train the model. The results are then aggregated to create an overall estimate of the out of sample error. 

## Model 1: Linear Discriminant Analysis 

We begin the predictive modeling exercise with a simple classification model based on linear discriminant analysis. We chose this approach because it is a relatively simple model that can serve as a baseline for prediction accuracy.  

```{r ref.label="buildModel1", echo=FALSE}
# run LDA model 
```

The model has an overall accuracy of 77%, with the highest sensitivity being .84 for classifying an exercise as class A when it is indeed A. The model performs worst on class B, with only 71% sensitivity. The confusion matrix illustrates that a classficiation model based on linear discriminant analysis does not have sufficient accuracy for us to expect perfect or near-perfect classification of our unknown validation cases.

## Model 2: Random Forest 

The random forest technique generates multiple predictive models, and aggregates them to create a final result. Random forests have a high degree of predictive power, and can be tuned according to a variety of parameters, including a range of choices from k-fold cross validation to leave one out bootstrapping. As we did with the linear discriminant analysis, we use k-fold cross validation with five folds. 

```{r ref.label="useParallel", echo=FALSE}
# turn on parallel processing 
```


```{r ref.label="buildModel2", echo=FALSE}
# run randomForest model 
```

```{r ref.label = "termParallel", echo = FALSE}
 # stop parallel processing 
```

The random forest model is extremely powerful, correctly classifying all cases in our training data set. The algorithm produces optimal results with 30 predictors, reaching a maximum accuracy of 0.994 as illustrated by the following chart. 

```{r plotRFAccuracy, echo=FALSE}
plot(modFit2,
     main="Accuracy by Predictor Count")
```

The final model selected by the algorithm quickly minimizes the error term, stabilizing below 0.02 after approximately 50 trees. As trees are added beyond 50, they do not appear to meaningfully reduce the error. There is also little variability in the error term across folds, as illustrated by the following plot. 

```{r plotErr, echo=FALSE}
plot(modFit2$finalModel,main="Error by Fold: Random Forest Model")

```

The relative importance of the variables is illustrated by the following variable importance plot. The six most important variables include `num_window`, `roll_belt`, `yaw_belt`, `roll_forearm`, `magnet_drumbell_x`, and ``pitch_belt`, each of which decreases the mean node impurity by at least 600, whereas the remaining variables decrease node impurity by 350 or less, using the summed and normalized Gini Coefficient. See [Dinsdale and Edwards - 2015](https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html) for additional background on the Gini Coefficient in the randomForest variable importance.  

```{r varImp, echo=FALSE}
varImpPlot(modFit2$finalModel,
           main="Variable Importance Plot: Random Forest",
           type=2)
```

### Expected Out of Sample Error

Given the accuracy level achieved in the training data set, we expect the out of sample error rate to be less than 1%, giving us a 0.87 probability that we will correctly classify all of the validation cases.  

## Results

The results for our random forest were excellent, with an OOB estimate of error rate at 0.55% When we apply the model to the test data set that we held out of of the model building steps, we find that the model accurately predicts 99.45% of the test cases, incorrectly classifying 43 of the 7,846 observations.  


Finally, our accuracy at predicting the 20 cases in the validation data set was 100%.  All in all, a good effort for our first attempt at a random forest. 

## Appendix

```{r dataDownload, echo=FALSE,eval = FALSE}
theFiles <- c("pml-testing.csv","pml-training.csv")
theDirectory <- "./data/"
dlMethod <- "curl"
if(substr(Sys.getenv("OS"),1,7) == "Windows") dlMethod <- "wininet"
if(!dir.exists(theDirectory)) dir.create(theDirectory)
for (i in 1:length(theFiles)) {
     aFile <- paste(theDirectory,theFiles[i],sep="")
     if (!file.exists(aFile)) {
          url <- paste("https://d396qusza40orc.cloudfront.net/predmachlearn/",
                       theFiles[i],
                       sep="")
          download.file(url,destfile=aFile,
                        method=dlMethod,
                        mode="w") # use mode "w" for text 
     }
}
```


```{r readData, echo=TRUE, eval = FALSE}
pkgs <- c("lattice","MASS","ggplot2","grid","readr","knitr","caret","YaleToolkit")
notInstalled <- pkgs[!(pkgs %in% installed.packages())]
if(sum(!(pkgs %in% installed.packages())) > 0) {
  for(i in notInstalled) install.packages(i)
}
library(lattice)
library(MASS)
library(ggplot2)
library(grid)
library(readr)
library(knitr)
library(caret)
library(YaleToolkit)
string40 <-  "ncnnccnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn"
string80 <-  "nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn"
string120 <- "nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn"
string160 <- "nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnc"
colString <- paste(string40,string80,string120,string160,sep="")

validation <- readr::read_csv("./data/pml-testing.csv",
     col_names=TRUE,
     col_types=colString)
originalData <- readr::read_csv("./data/pml-training.csv",
     col_names=TRUE,
     col_types=colString)
# fix missing column name for "observation / row number"
theColNames <- colnames(originalData)
theColNames[1] <- "obs"
colnames(originalData) <- theColNames

originalData$classe <- as.factor(originalData$classe)
valResult <- whatis(originalData)
# retain all columns with fewer than 50 missing values
theNames <- as.character(valResult[valResult$missing < 50 & valResult$variable.name != "obs",1])
originalSubset <- originalData[,theNames]
# remove date variables and binary window 
originalSubset <- originalSubset[c(-2,-3,-4,-5)]
# valSubset <- whatis(originalSubset)
set.seed(102134)
trainIndex <- createDataPartition(originalSubset$classe,p=.60,list=FALSE)
training <- originalSubset[trainIndex,]
testing <- originalSubset[-trainIndex,]

```

```{r useParallel, echo=TRUE,eval = FALSE}
library(iterators)
library(parallel)
library(foreach)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

```

```{r buildModel1, echo=TRUE, cache=TRUE,eval = FALSE}
yvars <- training[,55]
xvars <- training[,-55]
intervalStart <- Sys.time()
mod1Control <- trainControl(method="cv",number=5,allowParallel=TRUE)
# modFit1 <- train(x=xvars,y=yvars,method="rpart",trControl=mod1Control)
modFit1 <- train(classe ~ .,data=training,method="lda",trControl=mod1Control)
# Model 1
intervalEnd <- Sys.time()
paste("Train model1 took: ",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
pred1 <- predict(modFit1,training)
confusionMatrix(pred1,training$classe)
# predicted_test <- predict(modFit1,testing)
# confusionMatrix(predicted_test,testing$classe)
# predicted_validation <- predict(modFit,validation)
```

```{r buildModel2, echo=TRUE, cache=TRUE,eval = FALSE}
library(randomForest)
intervalStart <- Sys.time()
mod2Control <- trainControl(method="boot",number=25,allowParallel=TRUE)
modFit2 <- train(classe ~ .,data=training,method="rf",trControl=mod2Control)
intervalEnd <- Sys.time()
print(modFit2)
paste("Train model2 took: ",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
                                                                
 
pred2 <- predict(modFit2,training)
 
confusionMatrix(pred2,training$classe)
 
predicted_test <- predict(modFit2,testing)
confusionMatrix(predicted_test,testing$classe)
  
```


```{r writeFiles, echo=TRUE,eval = FALSE}
# generate predictions on validation data set 
predicted_validation <- predict(modFit2,validation)
# compare to correct answers as validated by submitting the individual files to Coursera for
# part 2 of the assignment 
answers <- c("B" ,"A","B","A", "A","E", "D", "B", "A", "A",
             "B", "C", "B", "A", "E", "E", "A", "B", "B", "B")
results <- data.frame(answers,predicted_validation)
which(as.character(results$answers) != as.character(results$predicted_validation))


pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste("./data/problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
predicted_chars <- as.character(predicted_validation)
pml_write_files(predicted_chars)

```

```{r termParallel,echo = FALSE,eval = FALSE}
   stopCluster(cluster)
   registerDoSEQ()

```

```{r sessionData,echo = FALSE, eval = TRUE}
   sessionInfo()
```

# References

1. Dinsdale, L. and Edwards, R. -- [Random Forests Webpage](https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html), retrieved from the _Metagenomics. Statistics._ website on December 19, 2015. 

2. Velloso, E. et. al. (2013) -- [Qualitative Activity Recognition of Weight Lifting Exercises](http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201), Proceedings of the 4th International Conference in Cooperation with SIGCHI (Augumented Human '13), Stuttgart, Germany, ACM SIGCHI, 2013.