-
Notifications
You must be signed in to change notification settings - Fork 1
/
Project.Rmd
205 lines (141 loc) · 6.55 KB
/
Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Practical Machine Learning Project"
author: "VC2015"
date: "June 20, 2015"
output: html_document
---
## Data
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
## Synopisis
In what follows we try to predict the outcome variable (categorical variable called "classe") using 53 raw predictors, using popular machine learning algorithms.
Initially, we considered 4 algorithms:
* 1. basic regression tree,
* 2. basic multinomial logistic regression with lasso penalty,
* 3. linear vector support machines,
* 4. random forest (a mixture of regression trees).
The algorithms 2 and 3 failed to compute on the data given, so we eliminated these two algorithms from consideration on purely computational grounds (the code for algorithm 2 is given below; the reader can try it, but it failed to give results over 2 hours on a Mac Book Pro). This left us with not much to choose from: the basic regression trees and random forest. It was very easy for random forest to outform a basic regression tree, so we ended up picking the random forest.
When judging the performance of the two competing algorithms, we used the training and validation sample, resulting from 60 to 40 split of the original training data. We also have an additional holdout sample of 20 observations, but that's too small to be useful as a real validation sample, so we only considered that last bit of data as a mere curiosity.
## Load Packages
We first load the required packages.
```{r}
library(caret)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(randomForest)
```
## Read Data
We read in the data from the web.
```{r, cache=TRUE}
train.site <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test.site <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- read.csv(url(train.site), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(test.site), na.strings=c("NA","#DIV/0!",""))
```
## Look at data
We then look at the basic structure of the data.
```{r}
dim(train)
dim(test)
#head(train)
str(train)
#head(test)
str(test)
```
## Drop predictors that are NAs
We remove predictors that are completely missing.
```{r}
train<-train[, colSums(is.na(train)) == 0]
dim(train)
test <-test[,colSums(is.na(test)) == 0]
dim(test);
attach(train)
```
# Delete some variables that are not highly relevant (time stamps etc.)
We now drop variables that don't seem relevant for external prediction, but
we have kept the "user name", since it could be relevant to prediction for the test data
we are given.
```{r}
train<- train[,-c(1, 3: 7)]
test <- test[,-c(1, 3: 7)]
```
# Look at how many NAs we have left:
```{r}
colSums(is.na(train))
colSums(is.na(test))
```
It seems like the simple data-cleaning has produced a nice complete data-set.
# Split train into training and validation samples
Since the hold-out sample is ridiculously small, we split the training
sample into a training sample and a validation/testing sample, with the split
ratio 6:4.
```{r}
intrain<- createDataPartition(y=train$classe, p=0.6, list=FALSE)
training<- train[intrain,]
testing<- train[-intrain,]
dim(training)
dim(testing)
```
# Try basic regression tree model for predicting to see what we get
Here we fit the basic regression tree model and plot the resulting tree.
We also look at the "confusion matrix", which summarizes the performance of the model in the testing data. We see that the model predicts correctly 49% of the cases, with a 95% confidence set of (0.4843, 0.5065), which is greater than roughly the random guess of 20% for the 5 cases of the "classe" outcome.
```{r,cache=TRUE}
regtreeFit <- train(classe ~.,data=training,preProcess=c("center","scale"),method="rpart",
metric="Accuracy")
print(regtreeFit$finalModel)
fancyRpartPlot(regtreeFit$finalModel,cex=.5,under.cex=1,shadow.offset=0)
```
```{r}
regtreepredict=predict(regtreeFit,testing)
confusionMatrix(testing$classe,regtreepredict)
```
# Next we try the multinomial logistic model with Lasso penalization
```{r}
#logitModel <- train(as.factor(classe)~.,data=training, method='glmnet',tuneGrid=expand###.grid(.alpha=1, .lambda=20), family
#="multinomial")
#plot(model)
#coef(model$finalModel, s=model$bestTune$.lambda)
```
This failed to compute in a reasonable amount of time, so we abandoned it, having to force-kill the rprocess. Too bad for glmnet package. We wish it were more efficient computationally.
# We next try random forest, which is a mixture of regression trees.
The random forest is a mixture of regression tree, each fitted over a different bootstrap sample, which allows it to combine basic tree
```{r, cache=TRUE}
set.seed(1)
forestFit <- train(classe ~ ., method="rf",trControl=trainControl(method = "cv", number =5), data=training)
```
```{r,cache=TRUE}
print(forestFit)
forestpredict=predict(forestFit,testing)
confusionMatrix(testing$classe,forestpredict)
```
We see that the random forest produces 99% of the out-of-sample prediction accuracy, with a 95% confidence interval of (0.9887, 0.993), which is extremely impressive performance. It dominates the simple regression tree model along many other measures, such as specificity and sensitivity. Thus, we have little choice but to pick the random forest as the winning algorithm.
## Predicting the test cases
Now we go the 20 test cases provided to use and build predictions for those.
```{r}
forestpredict.final<-predict(forestFit,test)
print(forestpredict.final)
```
Write out the answers for the automatic grading
```{r}
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(forestpredict.final)
```
## Conclusion
We tried to predict the outcome variable (categorical variable called "classe") using 53 raw predictors, using popular machine learning algorithms.
We considered 4 algorithms:
* 1. basic regression tree,
* 2. basic multinomial logistic regression with lasso penalty,
* 3. linear vector support machines,
* 4. random forest (a mixture of regression trees).
We eliminated algorithms 2 and 3 on computational grounds. We have chosen algorithm 4 based upon the far superior out-of-sample prediction performance, namely used a hold-out validation sample to test the final choice between the two algorithms.