Project.Rmd

---
title: "Practical Machine Learning Project"
author: "VC2015"
date: "June 20, 2015"
output: html_document
---

## Data

The training data for this project are available here: 

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: 

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. 

## Synopisis

In what follows we try to predict the outcome variable (categorical variable called "classe") using 53 raw predictors, using popular machine learning algorithms.  

Initially, we considered 4 algorithms:

* 1. basic regression tree,
* 2. basic multinomial logistic regression with lasso penalty,
* 3. linear vector support machines,
* 4. random forest (a mixture of regression trees).

The algorithms 2 and 3 failed to compute on the data given, so we eliminated these two algorithms from consideration on purely computational grounds (the code for algorithm 2 is given below; the reader can try it, but it failed to give results over 2 hours on a Mac Book Pro).  This left us with not much to choose from: the basic regression trees and random forest. It was very easy for random forest to outform a basic regression tree, so we ended up picking the random forest.

When judging the performance of the two competing algorithms, we used the training and validation sample, resulting from 60 to 40 split of the original training data.  We also have an additional holdout sample of 20 observations, but that's too small to be useful as a real validation sample, so we only considered that last bit of data as a mere curiosity.  


## Load Packages

We first load the required packages.

```{r}
library(caret)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(randomForest)
```

## Read Data 

We read in the data from the web.

```{r, cache=TRUE}
train.site <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test.site <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- read.csv(url(train.site), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(test.site), na.strings=c("NA","#DIV/0!",""))
```

## Look at data

We then look at the basic structure of the data.

```{r}
dim(train)
dim(test)
#head(train)
str(train)
#head(test)
str(test)
```
## Drop predictors that are  NAs

We remove predictors that are completely missing.

```{r}
train<-train[, colSums(is.na(train)) == 0]
dim(train)
test <-test[,colSums(is.na(test)) == 0]
dim(test);
attach(train)
```

# Delete some variables  that are not highly relevant (time stamps etc.)

We now drop variables that don't seem relevant for external prediction, but
we have kept the "user name", since it could be relevant to prediction for the test data
we are given.

```{r}
train<- train[,-c(1, 3: 7)]
test <- test[,-c(1, 3: 7)]
```

# Look at how many NAs we have left:

```{r}
colSums(is.na(train))
colSums(is.na(test))
```
It seems like the simple data-cleaning has produced a nice complete data-set.

# Split train into training and validation samples

Since the hold-out sample is ridiculously small, we split the training
sample into a training sample and a validation/testing sample, with the split
ratio 6:4.

```{r}
intrain<- createDataPartition(y=train$classe, p=0.6, list=FALSE)
training<-  train[intrain,]
testing<-  train[-intrain,]
dim(training)
dim(testing)
```

# Try basic regression tree model for predicting to see what we get

Here we fit the basic regression tree model and plot the resulting tree.
We also look at the "confusion matrix", which summarizes the performance of the model in the testing data.  We see that the model predicts correctly 49% of the cases, with a 95% confidence set of  (0.4843, 0.5065), which is greater than roughly the random guess of 20% for the 5 cases of the "classe" outcome.

```{r,cache=TRUE}
regtreeFit <- train(classe ~.,data=training,preProcess=c("center","scale"),method="rpart",
metric="Accuracy")
print(regtreeFit$finalModel)
fancyRpartPlot(regtreeFit$finalModel,cex=.5,under.cex=1,shadow.offset=0)
```

```{r}
regtreepredict=predict(regtreeFit,testing)
confusionMatrix(testing$classe,regtreepredict)
```


# Next we try the multinomial logistic model with Lasso penalization

```{r}
#logitModel <- train(as.factor(classe)~.,data=training, method='glmnet',tuneGrid=expand###.grid(.alpha=1, .lambda=20), family
#="multinomial")
#plot(model)
#coef(model$finalModel, s=model$bestTune$.lambda)
```

This failed to compute in a reasonable amount of time, so we abandoned it, having to force-kill the rprocess.  Too bad for glmnet package. We wish it were more efficient computationally.


# We next try random forest, which is a mixture of regression trees.

The random forest is a mixture of regression tree, each fitted over a different bootstrap sample, which allows it to combine basic tree


```{r, cache=TRUE}
set.seed(1)
forestFit <- train(classe ~ ., method="rf",trControl=trainControl(method = "cv", number =5), data=training)
```

```{r,cache=TRUE}
print(forestFit)
forestpredict=predict(forestFit,testing)
confusionMatrix(testing$classe,forestpredict)
```

We see that the random forest produces 99% of the out-of-sample prediction accuracy, with a 95% confidence interval of (0.9887, 0.993), which is extremely impressive performance.  It dominates the simple regression tree model along many other measures, such as specificity and sensitivity.    Thus, we have little choice but to pick the random forest as the winning algorithm.


## Predicting the test cases

Now we go the 20 test cases provided to use and build predictions for those.

```{r}
forestpredict.final<-predict(forestFit,test)
print(forestpredict.final)
```

Write out the answers for the automatic grading
```{r}
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(forestpredict.final)
```

## Conclusion


We tried to predict the outcome variable (categorical variable called "classe") using 53 raw predictors, using popular machine learning algorithms.  

We considered 4 algorithms:

* 1. basic regression tree,
* 2. basic multinomial logistic regression with lasso penalty,
* 3. linear vector support machines,
* 4. random forest (a mixture of regression trees).

We eliminated  algorithms 2 and 3 on computational grounds. We have chosen algorithm 4 based upon the far superior out-of-sample prediction performance, namely used a hold-out validation sample to test the final choice between the two algorithms.