-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMachineLearningElectionProject.Rmd
571 lines (488 loc) · 28.8 KB
/
MachineLearningElectionProject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
---
title: "Untitled"
author: "Annie Pogosyan"
date: "8/24/2020"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set
library(dplyr)
library(tidyverse)
library(knitr)
library(tree)
library(maptree)
```
__Background__ The presidential election in 2012 did not come as a surprise. Some predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us.
__Question 1__ What makes voter behavior (and thus election forecasting) a hard problem?
Voter behavior prediction is a diffiult problem due to vairous reasons. Firstly, data is generally collected through polls which have both sampling variability and signal variability. The article mentions that pollsters may estimate sampling variabiity, but correcting for the rest of the variation may be difficult. Sometimes, people do not respond accurately to polls, perhaps for fear of being judged, and/or polls can contain selection bias. Further, the article mentions "knowable unknowns." There is definite random variation that is very difficult to account for when forecasting election results. We have predictors such as campaign ads, and their success can sometimes be unaccounted for. These random terms are reffered to as shocks. Overall, many factors affect voter behavior which make election forecasting difficult.
__Question 2__ What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?
Several unique factors made Nate Silver's appraoch in predicting the 2012 presidential election outcome accurate such as hiearchical clustering, time series, and graph theory. Silver's approach alloted for the inclusion of flexibility and variability from prior election cycles (such as polling accuracy from prior elections) that he included in his current model. He accounted for many possible variations of outcomes and voter desicions that could easily change. Silver used prior approaches (maximum probability based on likelihood from every voter) along with Bayes' Therom and factored in the bias from pollsters at all levels (national and local) to help estimate any potential bias. He also used time as a factor, as decisions may change, vary over time, and thus provide a level of uncertainty as to the actual final decision of each voter. So he incorporated daily simulations to estimate the day to day change in voters opinions. All these together helped him build this model that yielded accurate results.
__Question 3__ What went wrong in 2016? What do you think should be done to make future predictions better?
The following opinions are based off the article called “The Polls Missed Trump. We Asked Pollsters Why.” by Carl Bialik and Harry Enten. We believe the main problem with the 2016 predictions was the reputation of the candidates, namely Trump. Because of his unusual "presidential" behavior many voters suffered the Shy Tory effect, meaning they were embaressed to admit that they were supporting Trump, the obvious shameful choice, which is why those polled by an automated machine were more honest about their presidential candidate choice than those polled by a live person. Also it was predicted that whites without college degrees would vote for Clinton, which turned out to be false and thus caused an average of 7.4 percentage points to be attributed to Clinton as opposed to Trump, giving her a false predicted lead. Clinton's reputation wasn't remarkable either as she wasn't the cleanest candidate, and thus with two non-ideal candidates, voters turned towards third party candidates, and last minute these voters skewed toawards Trump.
To avoid this in the future to make future predictions better, pollsters should consider doing more automated polling as opposed to live polling and voters should consider voting for better and more likeable (and thus more likely to win) candidates (such as Sanders or Bush) in the primaries to not be stuck between the two worst candidates in the final election.
#Data
```{r, warning=FALSE, echo=FALSE, include=FALSE}
#data
election.raw <- read_delim("data/election/election.csv", delim = ",") %>% mutate(candidate=as.factor(candidate))
census_meta <- read_delim("data/census/metadata.csv", delim = ";", col_names = FALSE)
census <- read_delim("data/census/census.csv", delim = ",")
```
__Election Data__
__Question 4__
The first few rows of the "electionr.raw" data are as follows:
```{r, echo=FALSE}
kable(head(election.raw))
```
```{r, echo=FALSE, include=FALSE}
#removing rows with fips=2000
kable(election.raw %>% filter(fips!=2000))
dim(election.raw)
```
We have excluded rows that have fips = 2000, i.e the state of Alaska. This is because our data already accounts for the state of Alaska and thus, rows with fips=2000 were duplicates. We know have 18,351 rows and 5 columns.
__Census Data__
Following is the first few rows of the census data:
```{r}
kable(head(census_meta))
```
__Data Wrangling__
__Question 5__
No outcome will show but code is in rmd file
```{r, echo=FALSE}
#remove summary rows from election.raw data
election_federal <- filter(election.raw, fips=="US")
election_state <- filter(election.raw, state != "US" &is.na(county))
election <- filter(election.raw, !is.na(county))
```
__Question 6__
The following is a bar chart of all votes received by each of the 32 candidates, followed by tables indicating the winner by each county and each state.
```{r, echo=FALSE, include=FALSE}
length(unique(election.raw$candidate))
#32 candidates
```
```{r, echo=FALSE}
ggplot(data=election.raw) +
geom_bar(mapping=aes(x=candidate, y=votes), stat="identity") +
xlab("Presidential Candidate") + ylab("Number of Votes") +
theme(axis.text.x = element_text(angle = -90)) +
ggtitle("2016 Presidential Election Vote Count")
```
__Question 7__
```{r, echo=FALSE, include=FALSE}
#county winner
county_winner <- election %>% group_by(fips) %>%
mutate(total=sum(votes), pct=votes/total) %>%
top_n(1)
#state winner
state_winner <- election_state %>% group_by(fips) %>%
mutate(total=sum(votes), pct=votes/total) %>%
top_n(1)
```
```{r, echo=FALSE}
kable(head(state_winner), caption="Winner of State")
kable(head(county_winner), caption="Winner of County")
```
__Visualization__
__Question 8__
The following is a county-level map, colored by county.
```{r, echo=FALSE, warning=FALSE}
counties <- map_data("county")
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group),
color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
```
__Question 9__
The next map is the county-level map colored by the winning candidate for each state.
```{r, echo=FALSE, fig.align='center'}
states = map_data("state")
#create a common column
fips = state.abb[match(states$region, tolower(state.name))]
states = states %>%
mutate(states=fips)
#combine states and state_winner
state_winnerMap = left_join(states, state_winner,
by=c("states"="fips"))
#map
p = ggplot(data=state_winnerMap) +
geom_polygon(aes(x=long, y=lat, fill=candidate, group=group),
color="white") +
coord_fixed(1.3) + ggtitle("Winning Candidate by State")
p +
scale_fill_manual(values = c("red","blue", "green"))
```
__Question 10__
We next have a map illustrating the winning candidate by county.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
#create fips column for county variable
countyfips = maps::county.fips %>%
separate(polyname, c('region', 'subregion'), sep=",")
#combine county.fips into county
countymap = left_join(counties, countyfips)
countymap$fips = as.factor(countymap$fips)
#left join for county_winner
countywinner_map = left_join(countymap, county_winner)
#map
p2 = ggplot(data=countywinner_map) +
geom_polygon(aes(x=long, y=lat, fill=candidate, group=group),
color="white") +
coord_fixed(1.3) + ggtitle("Winning Candidate by County")
p2 +
scale_fill_manual(values = c("red","blue", "green"))
```
__Question 11__
Many exit polls noted that demographics played a big role in the election. Since this was the second election with a woman candidate, we believe the gender of voter population played a role in the election results. The following is a pie chart displaying the proportion of American women to men.
```{r, warning=FALSE, echo=FALSE}
census2 = census[,c("County","Men","Women")]
x=sum(census2$Women)
y=sum(census2$Men)
df=c(x,y)
pie(df, labels=c("Men", "Women"), col=c("steelblue","maroon"),
main ="Proportion of Men to Women in the U.S.")
```
Seeing that men have a higher population, we will now look at a county map illustrating the proportion of men.
```{r, echo=FALSE, message=FALSE}
#proportion of men by county
men <- census2 %>%
group_by(County) %>%
dplyr::summarise(Men=mean(Men,na.rm=TRUE))
men <- men %>%
mutate(subregion = tolower(County))
men <- left_join(men, counties)
#plot
ggplot(men, aes(long, lat, group=group)) +
geom_polygon(aes(fill=Men), color=alpha("white", 1/2)) +
geom_polygon(data=states, color="white", fill=NA) +
coord_fixed(1.3) +
ggtitle("Proportion of Men By County")
```
__Question 12__
The census data contains high resolution information (more fine-grained than county-level). In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted. The following are the first few rows of census.ct.
```{r, echo=FALSE}
#filter out any rows with missing values
census.del <- na.omit(census)
#convert Men, Employed, Citizen to %s
census.del$Men <- (census.del$Men/census.del$TotalPop)*100
census.del$Employed <-(census.del$Employed/census.del$TotalPop)*100
census.del$Citizen <- (census.del$Citizen /census.del$TotalPop)*100
#compute Minority attribute by combining
census.del <- census.del %>%
mutate(Minority = Hispanic+Black+Native+Asian+Pacific)
#remove columns
census.del <- select(census.del, -c(Women,Walk, PublicWork, Construction, Hispanic, Black, Native, Asian, Pacific))
#sub-county census data
census.subct <- census.del %>%
group_by(State, County) %>%
add_tally(TotalPop) %>%
mutate(CountyTotal=n) %>%
mutate(Weight = TotalPop/n) %>%
select(-n)
#county census data
census.ct <- census.subct %>%
summarize_at(vars(Men:CountyTotal),
funs(sum(.*Weight)))
kable(head(census.ct))
```
__Dimensionality Reduction__
__Question 13__
We will now run PCA for both county & sub-county level data. We choose to scale the features because our components are not all in the same units. Since PCA calculates a new projection of our data set with the concern of variance, we must transform our components to be of equal units. The following are the first few rows of the first 2 principal components of county data and sub-county data, respectively.
```{r, echo=FALSE}
#county level data
pca.county = prcomp(census.ct[,4:ncol(census.ct)], scale = TRUE)
ct.pc = pca.county$rotation[,1:2]
kable(head(ct.pc),caption='County First 2 principal Components')
#sub-county level data
pca.subcounty = prcomp(census.subct[,3:ncol(census.subct)], scale = TRUE)
subct.pc = pca.subcounty$rotation[,1:2]
kable(head(subct.pc), caption="Sub-County First 2 principal Components")
```
```{r, echo=FALSE, include=FALSE}
#3 features with largest absolute values of first principal component
ct.pc3 = sort(abs(pca.county$rotation[,1]), decreasing=TRUE)
ct.pc3
subct.pc3 = sort(abs(pca.subcounty$rotation[,1]),decreasing=TRUE)
subct.pc3
```
The three features with the largest absolute values of the first principal component for the county data are Income per Capital, Child Poverty (children under the povery level), and Poverty (total population under poverty level). For the sub-county data, the three features with the largest absolute values of the first principal component are Income per Capital, Professional (those employed in management, business, science, and arts), and Poverty. Generally, the second principal component has an opposite sign, indicating the data may be centered to begin with.
__Question 14__
```{r, echo=FALSE, include=FALSE}
#county
county.pve = pca.county$sdev^2/sum(pca.county$sdev^2)
county.cumpve = cumsum(county.pve)
county.min = min(which(county.cumpve>=.9))
county.min
subcounty.pve = pca.subcounty$sdev^2/sum(pca.subcounty$sdev^2)
subcounty.cumpve = cumsum(subcounty.pve)
sub.min = min(which(subcounty.cumpve>=.9))
sub.min
```
The minimum number of PCs needed to capture 90% of the variance for the county analyses is 13 and 17 for the subcounty analyses. The following are plots of PVE and CVE for both county and subcounty analyses.
```{r, echo=FALSE, warning=FALSE}
#County
par(mfrow = c(1,2))
plot(county.pve, ylab = "County PVE" , xlab = "Number of PCs",
type="line", col="red")
plot(county.cumpve, ylab = "Cumulative County PVE",
xlab = "Number of PCs", type="line", col="blue")
#Subcounty
par(mfrow = c(1,2))
plot(subcounty.pve, ylab = "Sub-County PVE" , xlab = "Number of PCs",
type="line", col="red")
plot(subcounty.cumpve, ylab = "Cumulative Sub-County PVE", xlab = "Number of PCs",
type="line", col="blue")
```
__Clustering__
__Question 15__
We will now perform hierarchical clustering with complete linkage. We will cut the tree to partition the observations into 10 clusters and then re-run the algorithm using the first 5 principal components of ct.pc as inputs instead of the orginal features. We illustrate the different clustering through two plots.
```{r, echo=FALSE}
census.dist = dist(scale(census.ct[,c(3:ncol(census.ct))]),
method="euclidean")
census.hierclust= hclust(census.dist, method="complete")
#partitions
census.hierclust = cutree(census.hierclust,10)
#rerun
census.distpc = dist(pca.county$x[,1:2], method="euclidean")
censuspc.hierclust = hclust(census.distpc, method="complete")
censuspc.hierclust = cutree(censuspc.hierclust, 10)
plot(census.hierclust, main="Original Features Clustering")
plot(censuspc.hierclust, main="First 2 PC Clustering")
```
By comparing and contrasting the results, we see that both algorithms seem to have the majority of observations in groups 1 and 2. However, there are more observations in Census clustering and less variation, versus PC clustering.
We will now investigate the cluster that contains San Mateo County. In order to deem which approach is more appropriate, we will calculate the SSE, which is a measure of variation within a cluster. We are looking for the lower SSE. We find that clustering with the prinicpal components is a better cluster for the San Mateo County, since it has a lower SSE. This could be because the features in the original feature clustering may have large differences and no relationships.
```{r, echo=FALSE, include=FALSE}
smclus1 = which(census.hierclust==census.hierclust[227])
smcensus = scale(census.ct[smclus1,-c(1:2)], scale=TRUE)
smclus2 = which(censuspc.hierclust==censuspc.hierclust[227])
smpc= census.ct[smclus2,]
#SSE
meancensus = as.matrix(colMeans(smcensus))
meanpc = as.matrix(colMeans(smpc[,-c(1:2)]))
census.sse = sum((census.ct[227,-c(1:2)]-meancensus)^2)/dim(smcensus)
census.sse[1]
pca.sse = sum((census.ct[227,-c(1:2)]-meanpc)^2)/dim(smpc)
pca.sse[1]
```
__Classification__
In order to train classification models, we need to combine county_winner and census.ct data. We then partition data into 80% training and 20% testing and define 10 cross-validation folds. We now want to create a decision tree and prune the tree to minimize misclassificaiton error. We will use the folds created for cross-validation and visualize two trees, one before pruning and one after.
```{r, echo=FALSE, fig.align='center'}
#given code
tmpwinner <- county_winner %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>%
## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus <- census.ct %>%
ungroup(State) %>%
mutate_at(vars(State, County), tolower)
election.cl <- tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))
## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))
#partition data into 80% training and 20% testing
set.seed(10)
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n)
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]
#10 cv folds
set.seed(20)
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
#error rate function
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso")
```
__Question 16__
```{r, echo=FALSE, warning=FALSE}
cl.tree = tree(candidate~., data=trn.cl)
cltree.cv = cv.tree(cl.tree, rand=folds, K=10, FUN=prune.misclass)
min = max(which(cltree.cv$dev %in% min(cltree.cv$dev)))
bestsize = cltree.cv$size[min]
#before pruning
draw.tree(cl.tree,cex=0.4) + title("Tree Before Pruning")
#prune
min = max(which(cltree.cv$dev %in% min(cltree.cv$dev)))
bestsize = cltree.cv$size[min]
cltree.pruned = prune.tree(cl.tree, best=bestsize,
method="misclass")
#after pruning tree
draw.tree(cltree.pruned, cex=0.4) + title("Tree After Pruning")
#save to records variable
cltreetrain = predict(cltree.pruned, type="class")
records[1,1] = calc_error_rate(cltreetrain, trn.cl$candidate)
cltreetest = predict(cltree.pruned, type="class", newdata = select(tst.cl, -candidate))
records[1,2] = calc_error_rate(cltreetest, tst.cl$candidate)
```
Looking at the decision trees, we can see that the main features influencing votes were whether the voter's income, where the voter was white, and whether they commuted using public transportation. Let's discuss why these might be influential factors. Starting from transit, we may assume that those who use public transformation tend to not be high income and/or live in areas that are not industrialized with public transportation. The tree tells us that those who do not use transit are more likely to vote for Trump. Continuing along the tree, we see that white is the next split. We know that racial demographis were a significant facet of voter prediction. Besides white women with a college degree, the rest of the white population voted for Trump. This follows our tree seeing as how most who were white and high income voted for Trump. Thus, we can see that those who did not use public transport, were white, and had a high income voted for Trump.
__Question 17__
We will now run a logistic regression to predict the winning candidate in each county. The most significant variables are Citizen, Professional, Service, Production, Drive, Employed, PrivateWork, and Unemployment. When comparing to our decision tree, we see that Professional is the only significant variable that the two analyses have in common. Looking at the Production variable, we can see that a one unit increase in the percentage of the population working in the production sector will result in a 1.575e-0.1 increase in the log-odds of a vote for Hillary Clinton. Further, a one unit increase in the percentage of population employed would result in a 1.603e-01 increase in log-odds votes for Clinton.
```{r, echo=FALSE, warning=FALSE, include=FALSE}
county.glm <- glm(candidate~., data=trn.cl, family=binomial(link="logit"))
sort(abs(summary(county.glm)$coeff[-1,1]), decreasing=T)
sort(summary(county.glm)$coeff[-1,1], decreasing=T)
names(which(summary(county.glm)$coeff[-1,4] < 0.05)) #important variables
trnprob <- predict(county.glm, type="response")
tstprob <- predict(county.glm, newdata=tst.cl, type="response")
trnpred <- ifelse(trnprob >= 0.5, "Hillary Clinton", "Donald Trump")
tstpred <- ifelse(tstprob >= 0.5, "Hillary Clinton", "Donald Trump")
records[2,1]=calc_error_rate(trnpred, trn.cl$candidate)
records[2,2]=calc_error_rate(tstpred, tst.cl$candidate)
summary(county.glm)
```
Now, we have some inidication that there is perfect separation, and thus a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. We will run K-fold cross validation and select the best regularization parameter for the logistic regression with the LASSO penalty.
__Question 18__
Following our analysis, we fid that the optimal $\lambda$ is 0.001. Our non-zero coefficients are all factors besides Men, Child Poverty, Other Transportation, Self Employed, and Minority. We can believe that Child Poverty is zero because it is highly correlated with Povetty, thus LASSO recognizes that including the variable may be redudant. Other Transportation is included in Transportation, and Self-Employed is correlated with Unemployment and Employed. Further, we may say that Minority is a zero coefficient because it is recognized in the variable White. Further, the coefficients of the LASSO logistic regression are smaller than those of the unpenalized logistic regression. This makes sense since LASSO is a shrinking method.
```{r, echo=FALSE, include=FALSE}
library(glmnet)
trn.cl = na.omit(trn.cl)
x=model.matrix(candidate~., election.cl)[,-1]
y1 = trn.cl$candidate
y2 = tst.cl$candidate
ychar = as.character(election.cl$candidate)
grid = c(1,5,10,50) * 1e-4
#optimal lambda
cvlasso = cv.glmnet(x[in.trn,], ychar[in.trn], lambda=grid,
alpha=1, family="binomial", foldid = folds)
bestlambda = cvlasso$lambda.min
bestlambda
#non-zero coeff
model = glmnet(x[in.trn,], ychar[in.trn], alpha=1,
family="binomial")
lassocoef = predict(model, type="coefficients", s=bestlambda)
lassocoef
lassotrain = predict(model, s=bestlambda, newx=x[in.trn,], type="class")
lassotest = predict(model, s=bestlambda, newx=x[-in.trn,], type="class")
#records
records[3,1] = calc_error_rate(lassotrain, y1)
records[3,2] = calc_error_rate(lassotest, y2)
```
__Question 19__
We will now compute ROC curves for the decision tree, logistic regression, and LASSO logistic regression using predictions on the test data.
```{r, echo=FALSE}
library(ROCR)
rocpred = predict(cltree.pruned, testX, type="vector")
predt1= prediction(rocpred[,13],
as.numeric(testY$candidate))
perf1= performance(predt1, "tpr", "fpr")
pred2=predict(glmfit, testX, type="response")
predt2 = prediction(pred2, as.numeric(unlist(testY)))
perf2 = performance(predt2, "tpr", "fpr")
pred3 = predict(cvlasso, newx=as.matrix(select(trn.cl,-candidate)),
type="response", s="lambda.min")
predt3 = prediction(pred3, as.numeric(trn.cl$candidate))
perf3 = performance(predt3, "tpr", "fpr")
plot.new()
plot(perf1, col="steelblue")
par(new=TRUE)
par(perf2, col="maroon")
par(new=TRUE)
plot(perf3, col="turquoise")
legend("bottomright", legend = c("Tree ROC", "GLM ROC", "Lasso ROC"), lty=c(1,1,1),col = c("steelblue", "maroon", 'darkolivegreen'))
```
```{r, echo=FALSE, fig.align='center'}
library(ROCR)
rocpred = predict(cltree.pruned, testX, type="vector")
predt1= prediction(rocpred[,13],
as.numeric(testY$candidate))
perf1= performance(predt1, "tpr", "fpr")
pred2=predict(glmfit, testX, type="response")
predt2 = prediction(pred2, as.numeric(unlist(testY)))
perf2 = performance(predt2, "tpr", "fpr")
pred3 = predict(cvlasso, newx=as.matrix(select(trn.cl,-candidate)),
type="response", s="lambda.min")
predt3 = prediction(pred3, as.numeric(trn.cl$candidate))
perf3 = performance(predt3, "tpr", "fpr")
plot.new()
plot(perf1, col="steelblue")
par(new=TRUE)
par(perf2, col="maroon")
par(new=TRUE)
plot(perf3, col="turquoise")
legend("bottomright", legend = c("Tree ROC", "GLM ROC", "Lasso ROC"), lty=c(1,1,1),
col=c("steelblue", "maroon", "darkolivegreen"))
```
__Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?__
__Taking it Further__
__Question 20__
Earlier in our project we found the minimum number of components necessary to capture 90% of the variance for both the county and sub-county analyses. An interesting analysis would be to perform PCA, using those minimum number of components, on the election.cl dataset used in forming our classifiers. Then, we will build those classifiers on the new features to test for accuracy.
Noting that PCA analysis may explain the variance in our data, if our variables are coorelated then we should come out with less uncorrelated variables. With less noise variability, we should have a better model.
We will use the first 2 principal components to conduct our analysis.
```{r, echo=FALSE, warning=FALSE}
set.seed(123)
#partition
n = nrow(election.cl)
in.trnpca = sample.int(n, 0.8*n)
#training set
train.pca = election.cl[in.trnpca, 2:ncol(election.cl)]
trueTrain = election.cl[in.trnpca, 1]
#test set
test.pca = election.cl[-in.trnpca, 2:ncol(election.cl)]
trueTest = election.cl[-in.trnpca, 1]
#pve and cum pve
pcomp = prcomp(train.pca, scale=TRUE)
pve = pcomp$sdev^2/sum(pcomp$sdev^2)
cumpve = cumsum(pve)
minnum = min(which(cumpve>=0.9))
par(mfrow=c(1,2))
plot(pve, ylab="PVE", xlab="Number of Principal Components", type="line", col="red")
plot(cumpve, ylab="Cumulative PVE", xlab="Number of Principal Components", type="line", col="blue")
train.pca = pcomp$x[,1:2]
train.pca = data.frame(candidate = trueTrain, train.pca)
kable(head(train.pca), caption="PCA Transformed Training Set")
test.pca = predict(pcomp, newdata=test.pca)
test.pca = as.data.frame(test.pca[,1:2])
set.seed(123)
folds.pca = sample(cut(1:nrow(train.pca), breaks=10, labels=FALSE))
records.pca = matrix(NA, nrow=3, ncol=2)
colnames(records.pca) = c("training error","test error")
rownames(records.pca) = c("tree", "logitistic", "lasso")
#decision tree
pca.tree = tree(candidate ~., data=train.pca)
pcatree.cv = cv.tree(pca.tree, rand=folds.pca, K=10, FUN=prune.misclass)
min = max(which(pcatree.cv$dev %in% min(pcatree.cv$dev)))
bestsize = pcatree.cv$size[min]
#pruned tree
pcatree.prune = prune.tree(pca.tree, best=bestsize, method="misclass")
#errors
pcatree.train = predict(pcatree.prune, type="class")
pcatree.test=predict(pcatree.prune, type="class", newdata=test.pca)
records.pca[1,1] = calc_error_rate(pcatree.train, trueTrain$candidate)
records.pca[1,2] = calc_error_rate(pcatree.test, trueTest$candidate)
#logistic regression
glm.pca <- glm(candidate~., data=train.pca, family=binomial(link="logit"))
sort(abs(summary(glm.pca)$coeff[-1,1]), decreasing=T)
sort(summary(glm.pca)$coeff[-1,1], decreasing=T)
names(which(summary(glm.pca)$coeff[-1,4] < 0.05))
trnprobpca <- predict(glm.pca, type="response")
tstprobpca <- predict(glm.pca, newdata=test.pca, type="response")
trnpredpca <- ifelse(trnprobpca >= 0.5, "Hillary Clinton", "Donald Trump")
tstpredpca <- ifelse(tstprobpca >= 0.5, "Hillary Clinton", "Donald Trump")
records.pca[2,1]=calc_error_rate(trnpredpca, trn.cl$candidate)
records.pca[2,2]=calc_error_rate(tstpredpca, tst.cl$candidate)
#lasso
x = model.matrix(candidate~., data=train.pca)[,-1]
y = as.character(trueTrain$candidate)
testx = model.matrix(candidate~., data=cbind(trueTest, test.pca))[,-1]
lassopca.train = cv.glmnet(x, y, alpha=1, family="binomial",
lambda=grid, foldid=folds.pca)
bestlambda.pca = lassopca.train$lambda.min
lassopca.model = glmnet(x,y,alpha=1, family="binomial")
lassopcatrain.pred = predict(lassopca.model, s=bestlambda.pca, newx=x, type="class")
lassopcatest.pred = predict(lassopca.model, s=bestlambda.pca, newx=testx, type="class")
#errors
records.pca[3,1] = calc_error_rate(lassopcatrain.pred,
as.character(trueTrain$candidate))
records.pca[3,2] = calc_error_rate(lassopcatest.pred, as.character(trueTest$candidate))
kable(records.pca, caption="PCA Records")
kable(records, caption="Original Feature Records")
```
Looking at our error rates, we notice that our PCA models perform worse than the models trained on the original features. We notice that the decision tree still has the smallest training and test errors, but is no longer equal to the logistic training error. PCA works to transform positively correlated variables into linerally uncorrelated variables. By decreasing our model performance after including PCA features, we can assume that are variables were not initially linearly correlated. In order to better conduct a PCA analysis, we should transform our data to make it more linear, perhaps by using log transformation.
Although a PCA transformation can be done on non-linear data, it will prove meaningless since it is to transform positively correalted data into linearlly uncorrelated variables. Thus, decision tree would be the best approach to answering different kinds of questions about voter prediction