-
Notifications
You must be signed in to change notification settings - Fork 0
/
Prediction Analysis Writeup.html
90 lines (81 loc) · 6.37 KB
/
Prediction Analysis Writeup.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
<!DOCTYPE html>
<!-- saved from url=(0070)file:///C:/Users/qxshower/Desktop/Prediction%20Analysis%20Writeup.html -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"></head><body>
<h1>Prediction Analysis Process Writeup</h1>
<h2>1. Load package and Set seed</h2>
<p><span style="color: #0000ff;">library(caret)</span></p>
<p><span style="color: #0000ff;">set.seed(12345)</span></p>
<h2>2. Read in Data </h2>
<p>NOTE: treat blank and invalid data (e.g., #DIV/0!) as <strong>na</strong></p>
<p><span style="color: #0000ff;">url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"</span></p>
<p><span style="color: #0000ff;">url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"</span></p>
<p><span style="color: #0000ff;">data_train <- read.csv(url_train, na.strings = c("", "NA", "#DIV/0!"))</span></p>
<p><span style="color: #0000ff;"> data_test <- read.csv(url_test, na.strings = c("", "NA", "#DIV/0!"))</span></p>
<h2>3. Clean Data</h2>
<h3><span style="color: #000000;">1) delete columns that contains NA data</span></h3>
<p><span style="color: #0000ff;">training<-data_train[,colSums(is.na(data_train)) == 0]</span></p>
<p><span style="color: #0000ff;">testing<-data_test[,colSums(is.na(data_test)) == 0]</span></p>
<h3><strong>2) delete columns (1 to 7) that are not related to the prediction</strong></h3>
<p><span style="color: #0000ff;">colnames(training)</span></p>
<table border="2" bordercolor="grey">
<tbody>
<tr>
<td>[1] "X" "user_name" "raw_timestamp_part_1"<br> [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window" <br> [7] "num_window" "roll_belt" "pitch_belt" <br>[10] "yaw_belt" "total_accel_belt" "gyros_belt_x" <br>[13] "gyros_belt_y" "gyros_belt_z" "accel_belt_x" <br>[16] "accel_belt_y" "accel_belt_z" "magnet_belt_x" <br>[19] "magnet_belt_y" "magnet_belt_z" "roll_arm" <br>[22] "pitch_arm" "yaw_arm" "total_accel_arm" <br>[25] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z" <br>[28] "accel_arm_x" "accel_arm_y" "accel_arm_z" <br>[31] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z" <br>[34] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell" <br>[37] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y" <br>[40] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y" <br>[43] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y" <br>[46] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm" <br>[49] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x" <br>[52] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x" <br>[55] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x" <br>[58] "magnet_forearm_y" "magnet_forearm_z" "classe"</td>
</tr>
</tbody>
</table>
<p><span style="color: #0000ff;">training<-training[,-c(1:7)]</span></p>
<p><span style="color: #0000ff;">testing <-testing[,-c(1:7)]</span></p>
<h2><span style="color: #0000ff;"><span style="color: #000000;">4. Split Data into Train and Validation Sets</span></span></h2>
<p><span style="color: #0000ff;">inTrain <- createDataPartition(y=training$classe, p=0.75, list=FALSE)</span></p>
<p><span style="color: #0000ff;">InTraining <- training[inTrain, ]</span></p>
<p><span style="color: #0000ff;">InTesting <- training[-inTrain, ]</span></p>
<h2>5. Train Different Models </h2>
<h3>1) Decision Tree</h3>
<p><span style="color: #0000ff;">model_dt <- train(classe ~ ., data=InTraining, method="rpart")</span></p>
<p><span style="color: #0000ff;">predict_dt <- predict(model_dt, InTesting)</span></p>
<p><span style="color: #0000ff;">confusionMatrix(as.factor(InTesting$classe), predict_dt)</span></p>
<table border="2" bordercolor="grey">
<tbody>
<tr>
<td>
<p>Confusion Matrix and Statistics</p>
<p>Reference<br>Prediction A B C D E<br> A 1252 30 90 0 23<br> B 396 317 236 0 0<br> C 434 24 397 0 0<br> D 343 151 310 0 0<br> E 114 132 229 0 426</p>
<p>Overall Statistics<br> <br> Accuracy : 0.4878 <br> 95% CI : (0.4737, 0.5019)<br> No Information Rate : 0.5177 <br> P-Value [Acc > NIR] : 1 <br> <br> Kappa : 0.3306 <br> <br> Mcnemar's Test P-Value : NA</p>
<p>Statistics by Class:</p>
<p>Class: A Class: B Class: C Class: D Class: E<br>Sensitivity 0.4931 0.48471 0.31458 NA 0.94878<br>Specificity 0.9395 0.85129 0.87424 0.8361 0.89338<br>Pos Pred Value 0.8975 0.33404 0.46433 NA 0.47281<br>Neg Pred Value 0.6332 0.91479 0.78637 NA 0.99425<br>Prevalence 0.5177 0.13336 0.25734 0.0000 0.09156<br>Detection Rate 0.2553 0.06464 0.08095 0.0000 0.08687<br>Detection Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373<br>Balanced Accuracy 0.7163 0.66800 0.59441 NA 0.92108</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
<h3>2) Random Forest</h3>
<p><span style="color: #0000ff;">model_rf <- train(classe ~ ., data=InTraining, method="rf")</span></p>
<p><span style="color: #0000ff;">predict_rf <- predict(model_rf, InTesting)</span></p>
<p><span style="color: #0000ff;">confusionMatrix(as.factor(InTesting$classe), predict_rf)</span></p>
<table border="2" bordercolor="grey">
<tbody>
<tr>
<td>
<p>Confusion Matrix and Statistics</p>
<p>Reference<br>Prediction A B C D E<br> A 1395 0 0 0 0<br> B 1 947 1 0 0<br> C 0 6 849 0 0<br> D 0 0 14 785 5<br> E 0 0 0 0 901</p>
<p>Overall Statistics<br> <br> Accuracy : 0.9945 <br> 95% CI : (0.992, 0.9964)<br> No Information Rate : 0.2847 <br> P-Value [Acc > NIR] : < 2.2e-16 <br> <br> Kappa : 0.993 <br> <br> Mcnemar's Test P-Value : NA</p>
<p>Statistics by Class:</p>
<p>Class: A Class: B Class: C Class: D Class: E<br>Sensitivity 0.9993 0.9937 0.9826 1.0000 0.9945<br>Specificity 1.0000 0.9995 0.9985 0.9954 1.0000<br>Pos Pred Value 1.0000 0.9979 0.9930 0.9764 1.0000<br>Neg Pred Value 0.9997 0.9985 0.9963 1.0000 0.9988<br>Prevalence 0.2847 0.1943 0.1762 0.1601 0.1847<br>Detection Rate 0.2845 0.1931 0.1731 0.1601 0.1837<br>Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837<br>Balanced Accuracy 0.9996 0.9966 0.9906 0.9977 0.9972</p>
</td>
</tr>
</tbody>
</table>
<p><strong>Validation results: though Random Forest took a much longer time to train, it results in a much higher accuracy of perdiction, so I will use this model to predict the testing data set.</strong></p>
<h2><strong>6. Prediction on testing data</strong></h2>
<p><span style="color: #0000ff;"> predict(model_rf, testing)</span></p>
<table border="2" bordercolor="grey">
<tbody>
<tr>
<td style="width: 311px;">[1] B A B A A E D B A A B C B A E E A B B B<br>Levels: A B C D E</td>
</tr>
</tbody>
</table>
<p><br><br></p>
<p> </p>
</body></html>