-
Notifications
You must be signed in to change notification settings - Fork 1
/
Classification.Rmd
executable file
·578 lines (464 loc) · 23.7 KB
/
Classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
---
title: "Classification"
author: "Nirzaree"
date: "09/07/2020"
output:
html_document:
fig_caption: true
toc: true
toc_float: true
toc_collapsed: true
toc_depth: 3
---
## Goal: Understand classification concepts with examples
## Theory
* What is classification?
A technique applied to estimate categorical output variable
* What types of classification techniques exist?
+ Binary
+ Multiclass
* (A few) Classification algorithms
+ Decision Trees
+ Linear Discriminant Analysis (LDA)
+ Random forest
+ Support Vector Machine (SVM)
+ k-nearest neighbors (kNN)
+ Naive Bayes
* Assumptions of classification
None that I can think of
* Preprocessing for classification
* Feature Engineering for classification
* Tuning a classification model
* Metrics for evaluating a classification model
+ Binary Classifier:
+ Confusion matrix
+ AUC-ROC Curve
+ Precision, Recall
+ Sensitivity, Specificity
+ Multiclass Classifier:
We now look at a couple of datasets
## Iris
```
Dataset description: Edgar Anderson's Iris Data
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
Task: Classification of iris species from the sepal & petal length and width features.
```
```{r setup,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(data.table)
library(caret)
source(paste0(Sys.getenv("Projects"),"/MLWithR/fMultipleClassificationModels.R"))
```
### 1. Load Data
```{r loaddata1,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#iris is already loaded
data(iris)
```
```{r basicSummary,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
summary(iris)
```
### 2. Feature Exploration
A couple of different ways of feature plots:
i) pair plot
ii) box plot
iii) density plot
```{r basicPlots,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
featurePlot(iris[,1:4],iris[,5],"pairs")
featurePlot(iris[,1:4],iris[,5],"box")
# featurePlot(iris[,1:4],iris[,5],"ellipse")
featurePlot(iris[,1:4],iris[,5],"density")
```
**Observation**: Petal Width & Length are good features. Sepal length okayish and sepal width not good.
### 3. Build model: a) Train-test split
```{r Classifiermodel_splitData,include=FALSE,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split data into test train
trainingIndex <- createDataPartition(iris[,5],p = 0.8,list = FALSE)
dtTrain <- iris[trainingIndex,]
dtTest <- iris[-trainingIndex,]
```
### 3. Build model: b) Training config:
method = 10 fold cross validation (todo: how many repeats?)
metric = accuracy
```{r Classifiermodel_config,include=FALSE,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
control <- trainControl(method = "cv",
number = 10)
metric <- "Accuracy"
```
### 3. Build model: c) Build multiple models
1. Linear Classifier
2. LDA
3. CART (Decision tree)
4. kNN
5. SVM
6. RF
7. Naive Bayes
We try these classification models using caret package's train function
```{r Classifiermodel_tryModels,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
AllModels <- fMultipleClassificationModels(TrainingData = dtTrain[,1:4],
TrainingLabels = dtTrain[,5],
TrainingControl = control,
EvalMetric = metric,
TrainingModels = c("vglmAdjCat","lda","rpart","knn","svmRadial","rf","nb")
)
```
### 4. Identify the best model
```{r CheckModelAccuracy,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
ClassificationResults <- resamples(list(LinClassifier = AllModels[[1]],
LDA = AllModels[[2]],
CART = AllModels[[3]],
knn = AllModels[[4]],
SVM = AllModels[[5]],
RF = AllModels[[6]],
NB = AllModels[[7]]))
summary(ClassificationResults)
dotplot(ClassificationResults)
```
**Observation:** Linear classifier has the highest accuracy overall.
### 5. Investigate the best model
```{r InvestigateTheWinningModel,include=FALSE,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
fit.linClassifier = AllModels[[1]]
print(fit.linClassifier)
print(fit.linClassifier$finalModel)
varImp(fit.linClassifier)
```
**Observation:** Variable importance in the model is matching our intuition from feature plots.
### 6. Predict on test data
```{r predictOnValidationSet,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtTest <- data.table(dtTest)
dtTest[,predictedClass := predict(fit.linClassifier,newdata = dtTest)]
confusionMatrix(dtTest[,predictedClass],dtTest[,Species])
```
**Observation:**
* 1 sample of virginica is being classified as versicolor by the model which results in reduced sensitivity towards the virginica class and slightly reduced specificity of the versicolor class.
* Balanced Accuracy of each of the classes are: Setosa (1), Versicolor (0.975), Virginica (0.95)
* Overall accuracy of the model is : 29/30 = 0.9667
**Summary:**
* This is a beautiful dataset with respect to:
1. Having completely balanced classes
2. All features having the same range and units
which eliminates a lot of preprocessing.
* We tried a few classification models on the dataset after splitting the dataset into an 80:20 train:test split with 10 fold cross validation.
* Linear Classifier had the highest accuracy among all the other models, hence we used it to make predictions on the remaining 20% data.
* Prediction accuracy on the remaining 20% data is ~97% (only 1 sample misclassified)
* Variable importance in each of the models is reflecting what appears from feature plots of the features. We looked at the variable importance on the linear classifier model. Petal Width & Length being important features followed by sepal length followed by sepal width.
We now look at another dataset: Titanict
## Titanic
```
Dataset description: Survival of passengers on the Titanic
This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.
Format
A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:
No Name Levels
1 Class 1st, 2nd, 3rd, Crew
2 Sex Male, Female
3 Age Child, Adult
4 Survived No, Yes
Task: Binary classification
```
```{r loadData2,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(rattle)
library(rpart)
library(rpart.plot)
# library(qwraps2)
# options(qwraps2_markup = "markdown")
mosaicplot(Titanic)
```
**High Level summary of the dataset**
1. Sex: Male, Female
2. Class: 1, 2, 3, Crew
3. Age: Adult, Child
Understanding of survival based on the above factors from the mosaic plot:
1. Higher Survival Rates:
+ 1st class + Female + (Any age) (mostly all survived)
+ 2nd class + Female + adult (majority survived)
+ 2nd class + Female + child (all survived)
2. Low survival rates:
+ 3rd class + Male + (Any age)
+ 2nd class + Male + Adult
```{r preprocessing,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 8, fig.height = 4}
temp <- data.table(Titanic)
dtTitanic <- temp[as.numeric(rep(row.names(temp),temp$N)),] #0 frequency rows are gone now.
dtTitanic[,N := NULL]
#convert categorical variables into factors
dtTitanic[,Class := as.factor(Class)]
dtTitanic[,Sex := as.factor(Sex)]
dtTitanic[,Survived := as.factor(Survived)]
dtTitanic[,Age := as.factor(Age)]
summary(dtTitanic)
```
### 3. Build model: a) Train-test split
```{r Classifiermodel_splitData2,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split data into test train
trainingIndex <- sample(seq(1:nrow(dtTitanic)),0.8*nrow(dtTitanic))
dtTitanicTrain <- dtTitanic[trainingIndex,]
dtTitanicTest <- dtTitanic[-trainingIndex,]
```
### 3. Build model: b) Training config:
method = 10 fold cross validation (todo: how many repeats?)
metric = accuracy
```{r Classifiermodel_config2,include=FALSE,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
control <- trainControl(method = "cv",
number = 10)
metric <- "accuracy"
```
### 3. Build model:
Distance based models are out.
We try a decision tree. We tune it with 2 different cp values 0.005,0.01 (default) & look at the tree logic.
```{r TitanicModel1,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
# lFormula <- paste0(names(dtTitanic)[4],"~",paste(names(dtTitanic)[1:3],collapse = "+"))
Titanic.cart <- train(Survived~Class+Sex+Age,
data = dtTitanicTrain,
method = "rpart",
trControl = control,
tuneGrid = expand.grid(cp = c(0.005,0.01)),
metric = metric)
fancyRpartPlot(Titanic.cart$finalModel)
(Titanic.cart)
```
**Observation:**
Tree logic looks good.
This is the max information that can be extracted from the categorical dataset available.
Accuracy of the tree on training data with cp = 0.005: ~78%
### 4. Predict on test data
```{r Titanicpredict1,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtTitanicTest[,"predictedClass"] <- predict(Titanic.cart,
dtTitanicTest)
confusionMatrix(dtTitanicTest[,predictedClass],dtTitanicTest[,Survived])
```
**Observations:**
Accuracy of the model is decent ~82% however it is quite poorly classifying the survived class. The accuracy number is still high because of the class imbalance (~2/3 not survived class). Balanced accuracy is 72%.
## Sonar Dataset
### 1. Load Data
```{r loadSonar,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(mlbench)
data("Sonar")
dtSonar <- data.table(Sonar)
rm(Sonar)
```
```
Dataset description: This is the data set used by Gorman and Sejnowski in their study of the classification of sonar signals using a neural network [1].
The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.
Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.
The label associated with each record contains the letter "R" if the object is a rock and "M" if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.
```
* References:
1. https://www.kaggle.com/edhenrivi/introduction-classification-sonar-dataset polynomial features
2. https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/ Keras binary classification : vanilla -> scalingoffeatures -> tuningofnetwork
3. https://www.simonwenkel.com/2018/08/23/revisiting_ml_sonar_mines_vs_rocks.html
4. http://people.duke.edu/~ccc14/sta-663-2016/05_Machine_Learning.html#Resources no NN, pca + standard classification approaches
### 2. Feature Exploration
```{r featureanalysisSonar,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
# boxplot(dtSonar[,1:60],
# xlab = "Frequency bin",
# ylab = "Power spectral density (normalized)",
# main = "Boxplots of all frequency bins")
#wide to long for colored boxplots
#ref: https://stackoverflow.com/questions/14604439/plot-multiple-boxplot-in-one-graph
dtSonarTall <- melt(dtSonar,id.vars = c("Class"))
ggplot(data = dtSonarTall,aes(x = variable, y = value)) + geom_boxplot(aes(fill = Class))
summary(dtSonar$Class)
```
### 3. Preprocessing
```{r preprocessingSonar,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
# dtSonar[, Rock := ifelse(Class == "R",1,0)]
# dtSonar[, Mine := ifelse(Class == "M",1,0)]
dtSonar[,ClassNumeric := ifelse(Class == "M",1,0)]
dtSonar[,ClassNumeric := as.factor(ClassNumeric)]
```
### 3. Build model: a) Train-test split
```{r SonarSplitData,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split data into test train
set.seed(10)
trainingIndex <- sample(seq(1:nrow(dtSonar)),0.8*nrow(dtSonar))
dtSonarTrain <- as.matrix(dtSonar[trainingIndex,1:60])
dtSonarTest <- as.matrix(dtSonar[-trainingIndex,1:60])
SonarTrainLabels <- dtSonar[trainingIndex,ClassNumeric]
SonarTestLabels <- dtSonar[-trainingIndex,ClassNumeric]
SonarTrainLabelsNN <- as.numeric(as.matrix(SonarTrainLabels))
SonarTestLabelsNN <- as.numeric(as.matrix(SonarTestLabels))
```
### 3. Build model: b) Training config:
method = 10 fold cross validation (todo: how many repeats?)
metric = accuracy
```{r SonarClassifiermodel_config,include=FALSE,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
control <- trainControl(method = "cv",
number = 10)
metric <- "accuracy"
```
### 3. Build model: c) Build multiple models
1. Linear Classifier
2. LDA
3. CART (Decision tree)
4. kNN
5. SVM (Linear)
6. SVM (Radial)
7. RF
8. Naive Bayes
We try these classification models using caret package's train function
```{r SonarClassifiermodel_tryModels,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
AllModelsSonar <- fMultipleClassificationModels(TrainingData = dtSonarTrain,
TrainingLabels = SonarTrainLabels,
TrainingControl = control,
EvalMetric = "accuracy",
TrainingModels = c("vglmAdjCat","lda","rpart","knn","svmLinear","svmRadial","rf","nb"))
```
### 4. Identify the best model
```{r SonarCheckModelAccuracy,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
ClassificationResultsSonar <- resamples(list(LinClassifier = AllModelsSonar[[1]],
LDA = AllModelsSonar[[2]],
CART = AllModelsSonar[[3]],
knn = AllModelsSonar[[4]],
SVMLinear = AllModelsSonar[[5]],
SVMRadial = AllModelsSonar[[6]],
RF = AllModelsSonar[[7]],
NB = AllModelsSonar[[8]]))
summary(ClassificationResultsSonar)
dotplot(ClassificationResultsSonar)
```
**Observation:** SVMRadial & Random forest are better than other models with median accuracy of ~85% & ~79% resp. We now look at test data accuracy of the 2 models.
### 5. Predict on test data
```{r Sonarpredict1,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
SonarSVMRadialPrediction <- predict(AllModelsSonar[[6]],
dtSonarTest)
confusionMatrix(SonarSVMRadialPrediction,SonarTestLabels)
SonarRFPrediction <- predict(AllModelsSonar[[7]],
dtSonarTest)
confusionMatrix(SonarRFPrediction,SonarTestLabels)
```
**Observations:** Test data accuracy of both the models is ~81% which is less than accuracy on training data for SVMRadial but around the same order as in Random Forest. Hence Random Forest looks a better model to implement.
We come back to tuning these models after a run of Neural Network.
### 5. Deep learning on the dataset
```{r SonarNN,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(keras)
SonarNN <- keras_model_sequential() %>% layer_dense(units = 60,activation = 'relu',input_shape = 60) %>% layer_dense(units = 1,activation = 'sigmoid')
SonarNN %>% compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = c('accuracy'))
history <- SonarNN %>% fit(
dtSonarTrain,
SonarTrainLabelsNN,
epochs = 100,
batch_size = 5,
validation_data = list(dtSonarTest,
SonarTestLabelsNN)
)
plot(history)
```
**Observation** Post epoch 30, the model starts overfitting to training data without any advantage on the test data accuracy. The validation accuracy at Epoch 30 is ~86% (todo: is this median/max/min?), which might be better than the models so far if it is median/min. accuracy.
### 6. Scaling the data
```{r SonarScaling, echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtSonarScaled <- scale(dtSonar[,1:60])
# colMeans(dtSonarTrainScaled)
# apply(dtSonarTrainScaled, 2, sd)
```
### 7. PCA on scaled data
```{r PCASonar,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtSonarPCA <- prcomp(dtSonarScaled[,1:60])
summary(dtSonarPCA)
# % of total variance explained with 25 components
(cumsum(dtSonarPCA$sdev^2)/sum(dtSonarPCA$sdev^2))[25]
#split pca data into train and test
```
### 8. Build model: a) Train-test split
```{r SonarPCASplitData,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split data into test train
dtSonarselectedPCA <- dtSonarPCA$x[,1:25]
set.seed(10)
trainingIndex <- sample(seq(1:nrow(dtSonarselectedPCA)),0.8*nrow(dtSonarselectedPCA))
dtSonarPCATrain <- as.matrix(dtSonarselectedPCA[trainingIndex,])
dtSonarPCATest <- as.matrix(dtSonarselectedPCA[-trainingIndex,])
SonarPCATrainLabels <- dtSonar[trainingIndex,ClassNumeric]
SonarPCATestLabels <- dtSonar[-trainingIndex,ClassNumeric]
SonarPCATrainLabelsNN <- as.numeric(as.matrix(SonarPCATrainLabels))
SonarPCATestLabelsNN <- as.numeric(as.matrix(SonarPCATrainLabels))
```
### 9. Try non-NN models again after scaling + PCA
```{r SonarClassifiermodel_tryModels_afterScalingAndPCA,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
AllModelsSonarAfterPCA <- fMultipleClassificationModels(TrainingData = dtSonarPCATrain,
TrainingLabels = SonarPCATrainLabels,
TrainingControl = control,
EvalMetric = "accuracy",
TrainingModels = c("vglmAdjCat","lda","rpart","knn","svmLinear","svmRadial","rf","nb"))
```
```{r SonarCheckModelAccuracy2,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
ClassificationResultsSonar2 <- resamples(list(LinClassifier = AllModelsSonarAfterPCA[[1]],
LDA = AllModelsSonarAfterPCA[[2]],
CART = AllModelsSonarAfterPCA[[3]],
knn = AllModelsSonarAfterPCA[[4]],
SVMLinear = AllModelsSonarAfterPCA[[5]],
SVMRadial = AllModelsSonarAfterPCA[[6]],
RF = AllModelsSonarAfterPCA[[7]],
NB = AllModelsSonarAfterPCA[[8]]))
summary(ClassificationResultsSonar2)
dotplot(ClassificationResultsSonar2)
```
**Observation:** knn is the most promising model now followed by linear classifier followed by CART decision tree.
### 10. Predict new models on test data
```{r Sonarpredict2,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
SonarknnPrediction <- predict(AllModelsSonarAfterPCA[[4]],
dtSonarPCATest)
confusionMatrix(SonarknnPrediction,SonarTestLabels)
SonarLinClassifierPrediction <- predict(AllModelsSonarAfterPCA[[1]],
dtSonarPCATest)
confusionMatrix(SonarLinClassifierPrediction,SonarTestLabels)
SonarCARTPrediction <- predict(AllModelsSonarAfterPCA[[3]],
dtSonarPCATest)
confusionMatrix(SonarCARTPrediction, SonarTestLabels)
```
**Observations:** knn accuracy on test data is ~90%.
Now we rerun NN on scaled data.
### 11. Build Deep Learning model on Scaled Data: a) Train-test split
```{r SonarScaledSplitData,include=FALSE,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
#split data into test train
set.seed(10)
trainingIndex <- sample(seq(1:nrow(dtSonarScaled)),0.8*nrow(dtSonarScaled))
dtSonarScaledTrain <- as.matrix(dtSonarScaled[trainingIndex,])
dtSonarScaledTest <- as.matrix(dtSonarScaled[-trainingIndex,])
SonarScaledTrainLabels <- dtSonar[trainingIndex,ClassNumeric]
SonarScaledTestLabels <- dtSonar[-trainingIndex,ClassNumeric]
SonarScaledTrainLabelsNN <- as.numeric(as.matrix(SonarScaledTrainLabels))
SonarScaledTestLabelsNN <- as.numeric(as.matrix(SonarScaledTestLabels))
```
### 12. Deep learning on the scaled dataset
```{r SonarNN2,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(keras)
SonarNN2 <- keras_model_sequential() %>% layer_dense(units = 60, activation = 'relu', input_shape = 60) %>% layer_dense(units = 1,activation = 'sigmoid')
SonarNN2 %>% compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = c('accuracy'))
history2 <- SonarNN2 %>% fit(
dtSonarScaledTrain,
SonarScaledTrainLabelsNN,
epochs = 100,
batch_size = 5,
validation_data = list(dtSonarScaledTest,
SonarScaledTestLabelsNN)
)
plot(history2)
```
**Observation** Post epoch 9, the accuracy on validation set is constant.Max validation accuracy is at Epoch 7 ~90%.
It is similar to the knn accuracy obtained on scaled + pca data. However NN can be tuned further for more accuracy.
## ToDo / ToAnswer
**Iris**
1. understand kappa
2. CI
3. What other metric apart from accuracy can be used here
**Titanic**
4. Which classification models are distance based and wont work with data with all categorical variables?
5. rpart vs caret - rpart
6. resamples ran on titanic without setting the seed for each models. Why? [Titanic]
7. How to broadly know how much accuracy can the dataset achieve on models? Like if features are just bad, then no point trying tuning models.
8. ctree vs rpart
9. find max accuracy on base titanic dataset (only 3 variables: sex, class, age)
10. work on detailed titanic dataset and see how the predictions are.
**Sonar**
11. So many details within each step to be understood properly
12. how to even choose an approach here? NN or non NN? pre process data or not? PCA or not?
13. How to validate the final answer?
14. Any misleading results?
## notes to self
1. Beautiful function that works on tables: mosaicplot : Gave all info of Titanic dataset in 1 plot.
2. Not best summary table but still: qwrap2 library's summary_table (qwrap2_markup = "markdown" needs to be configured)
3. Important function: resamples : Collation and Visualization of Resampling Results
resamples checks that the resampling results match; that is, the indices in the object trainObject$control$index are the same. Also, the argument trainControl returnResamp should have a value of "final" for each model.
The summary function computes summary statistics across each model/metric combination. [https://www.rdocumentation.org/packages/caret/versions/6.0-86/topics/resamples](https://www.rdocumentation.org/packages/caret/versions/6.0-86/topics/resamples)