-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path620_assignment11.Rmd
149 lines (113 loc) · 4.62 KB
/
620_assignment11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
author: "Alvaro Bueno"
date: "11/11/2018"
intitle: Assignment 11 - 620
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#install.packages('tm')
#install.packages('klaR')
#install.packages('httr')
#install.packages('maxent')
library('tm')
library('klaR')
library('httr')
```
## Predictive classifier for Spam Email - Alvaro Bueno
### preparing files
For this exercise i opted to use another set of files (separated in folders as spam and ham) to train and test the algorithm.
> Note: i'm leaving the local files in my presentation for speed purposes only, the online retrieval code is attached too but it's commented
```{r hamspam}
hamdir <- '/Users/alvbueno/Sites/dataScience/ham'
spamdir <- '/Users/alvbueno/Sites/dataScience/spam'
#the online code
# base_url <- 'https://raw.githubusercontent.com/delagroove/dataScience/master/'
# req <- GET("https://api.github.com/repos/delagroove/dataScience/git/trees/master?recursive=1")
# stop_for_status(req)
# filelist <- unlist(lapply(content(req)$tree, "[", "path"), use.names = F)
# spamfiles <- grep("spam/", filelist, value = TRUE, fixed = TRUE)
# hamfiles <- grep("ham/", filelist, value = TRUE, fixed = TRUE)
hamfiles <- list.files(hamdir)
spamfiles <- list.files(spamdir)
hamlist <- NA
spamlist <- NA
for(i in 1:length(hamfiles)){
thelines <- readLines(paste(hamdir, hamfiles[i], sep="/"), encoding = 'UTF-8')
thelist <- list(paste(thelines, collapse="\n"))
hamlist <- c(hamlist, thelist)
}
hamdataframe <- as.data.frame(unlist(hamlist), stringsAsFactors = FALSE)
hamdataframe$type <- "ham"
colnames(hamdataframe) <- c('data', 'type')
for(i in 1:length(spamfiles)){
thelines <- readLines(paste(spamdir, spamfiles[i], sep="/"), encoding = 'UTF-8')
thelist <- list(paste(thelines, collapse="\n"))
spamlist <- c(spamlist, thelist)
}
spamdataframe <- as.data.frame(unlist(spamlist), stringsAsFactors = FALSE)
spamdataframe$type <- "spam"
colnames(spamdataframe) <- c('data', 'type')
thedataframe = rbind(hamdataframe, spamdataframe)
```
## process train and test corpus
### remove punctuation, numbers and whitespace, stop words did not see to affect much the result. after that create a document matrix to create the classifier
```{r process}
sample_size <- floor(0.75 * nrow(thedataframe))
set.seed(1000)
train_ind <- sample(seq_len(nrow(thedataframe)), size = sample_size)
train_df <- thedataframe[train_ind,]
test_df <- thedataframe[-train_ind,]
spam_count <-subset(thedataframe, thedataframe$type =='spam')
ham_count <-subset(thedataframe, thedataframe$type =='ham')
train_corpus <- Corpus(VectorSource(train_df$data))
test_corpus <- Corpus(VectorSource(test_df$data))
train_corpus<- tm_map(train_corpus,removePunctuation)
train_corpus <- tm_map(train_corpus, removeNumbers)
train_corpus <- tm_map(train_corpus, stripWhitespace)
test_corpus<- tm_map(test_corpus, removePunctuation)
test_corpus <- tm_map(test_corpus, removeNumbers)
test_corpus <- tm_map(test_corpus, stripWhitespace)
test_term_matrix <- DocumentTermMatrix(test_corpus)
train_term_matrix <- DocumentTermMatrix(train_corpus)
```
### once the document matrixes are ready, we feed the train matrix to the algorithm, I'm using Maxent because it performs way better than Naive Bayes and there's no need to do a filtering function.
```{r classifier}
classifier <- maxent::maxent(train_term_matrix, factor(train_df$type))
test_predictions <- predict(classifier, feature_matrix = test_term_matrix)
```
### the results on the data frame test_predictions will show a number between 0 and 1 depending on how accurate the algorithm thinks that the text is ham or spam, we use that information to show results in a table.
```{r showres}
show_results <- function (x) {
initial_ham <- 0
initial_spam <- 0
predicted_ham <- 0
predicted_spam <- 0
for(i in 1:nrow(x)) {
if (x[i,1] == 'ham'){
initial_ham <- initial_ham + 1
}
if (x[i,1] == 'spam'){
initial_spam <- initial_spam + 1
}
if (as.numeric(x[i,2]) > 0.5){
predicted_ham <- predicted_ham + 1
}
if (as.numeric(x[i,3]) > 0.5){
predicted_spam <- predicted_spam + 1
}
}
array = data.frame()
array[1,1] <- initial_ham
array[1,2] <- initial_spam
array[1,3] <- predicted_ham
array[1,4] <- predicted_spam
array
}
arr <- show_results(test_predictions)
colnames(arr) <- c('Initial Ham', 'Initial Spam', 'predicted Ham', 'Predicted Spam')
arr
```
### the algorithm seems very precise and the results are close to the real thing, some times the algorithm is not able to know between spam and ham, so it will give each a value of 0.5, that's why the conditions include that number.