-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.xml
2891 lines (2842 loc) · 174 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Graham Tierney on Graham Tierney</title>
<link>https://g-tierney.github.io/</link>
<description>Recent content in Graham Tierney on Graham Tierney</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<copyright>&copy; 2018</copyright>
<lastBuildDate>Sun, 15 Oct 2017 00:00:00 -0400</lastBuildDate>
<atom:link href="/" rel="self" type="application/rss+xml" />
<item>
<title>Anonymous Cross-Party Conversations Can Decrease Political Polarization: A Field Experiment on a Mobile Chat Platform</title>
<link>https://g-tierney.github.io/publication/mediation_sensitivity/</link>
<pubDate>Thu, 23 Sep 2021 00:00:00 -0400</pubDate>
<guid>https://g-tierney.github.io/publication/mediation_sensitivity/</guid>
<description></description>
</item>
<item>
<title>Sensitivity Analysis for Causal Mediation through Text: an Application to Political Polarization</title>
<link>https://g-tierney.github.io/publication/discussit/</link>
<pubDate>Thu, 23 Sep 2021 00:00:00 -0400</pubDate>
<guid>https://g-tierney.github.io/publication/discussit/</guid>
<description></description>
</item>
<item>
<title>Author Clustering and Topic Estimation for Short Texts</title>
<link>https://g-tierney.github.io/publication/stldac/</link>
<pubDate>Tue, 15 Jun 2021 00:00:00 -0400</pubDate>
<guid>https://g-tierney.github.io/publication/stldac/</guid>
<description></description>
</item>
<item>
<title>Is the NFL's Home-Field Advantage Over?</title>
<link>https://g-tierney.github.io/post/home_field/</link>
<pubDate>Mon, 11 Jan 2021 00:00:00 +0000</pubDate>
<guid>https://g-tierney.github.io/post/home_field/</guid>
<description>
<link href="https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.css" rel="stylesheet" />
<script src="https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.js"></script>
<script src="https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js"></script>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>Home-field advantage (HFA) has been documented in many sports, and much speculation has covered the potential causal mechanisms (referee bias, travel times, crowd reactions, etc.). In the NFL, historically, playing at home has offered about a three point advantage in the point differential, the equivalent of a field goal. However, with COVID-19 imposed restrictions, many home games were conducted without fans, and the home team's advantage nearly disappeared. Home teams won 127 of 253 games (50.2%),<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> scoring on average only 0.01 more points than their opponents as opposed to the typical 3.00.</p>
<p>I began this project to see if I could plausibly measure the home-field advantage season-by-season to see just how unusual 2020 was. Indeed, the raw statistics are historic lows for the NFL. 2020 saw the third lowest home point differential (total points scored by home teams minus total points scored by away teams) and the 4th lowest home win percentage since 1966. When examining past seasons, another year jumps out. Just last year in the 2019 season, home teams were outscored by away teams for the first time since 1968. The plots below show the win rates and point differentials for the past 21 regular seasons with games at neutral fields, e.g. international games, removed.</p>
<p><img src="https://g-tierney.github.io/post/home_field_files/figure-html/eda-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The above are essentially just averages without controlling for team strength. Maybe 2020 and 2019 had unusual schedules that consistently placed lopsided match-ups with the favorite at home. Maybe teams with big home-field advantages and played strong opponents in 2020 and 2019. The rest of this post will investigate whether these two concerns to determine whether the results change when accounting for team strength and heterogeneity in home-field advantage across teams.</p>
</div>
<div id="adjusting-for-team-strength" class="section level2">
<h2>Adjusting for team strength</h2>
<p>In this section, I will specify a model for measuring home-field advantage while accounting for team strength and analyze home-field advantage over time, both in terms of points and win probability. I will focus primarily on the home team point differential, home team points minus away team points, for each game. This outcome captures what most people care about, the winner and loser, and contains more information than just modeling wins and losses directly. A team that consistently wins by 14 points is probably better than a team that wins by only 7. Using points also avoids having to deal with ties or 16-0 and 0-16 seasons, which pose some technical difficulties that I will explain later.</p>
<p>The model that I will use for point differential is <span class="math inline">\(Y_{hag} = \alpha_0 + \mu_h - \mu_a + \epsilon_{hag}\)</span> with <span class="math inline">\(\epsilon_{hag} \sim N(0,\sigma^2)\)</span>. <span class="math inline">\(Y_{hag}\)</span> is the score of home team <span class="math inline">\(h\)</span> minus the score of away team <span class="math inline">\(a\)</span> in game <span class="math inline">\(g\)</span>. <span class="math inline">\(Y_{hag}\)</span> is <span class="math inline">\(\alpha_0 + \mu_h - \mu_a\)</span> plus some noise, where <span class="math inline">\(\alpha_0\)</span> captures the home team's scoring advantage, <span class="math inline">\(\mu_h\)</span> measures the home team's strength and <span class="math inline">\(\mu_a\)</span> the away team's strength. The model has some nice interpretations of the parameters. <span class="math inline">\(\alpha_0\)</span> is the expected point differential when two equally skilled teams play. <span class="math inline">\(\mu_h - \mu_a\)</span> is the expected number of points <span class="math inline">\(h\)</span> will win or lose by when playing <span class="math inline">\(a\)</span> on a neutral field. Note that this interpretation is only for the difference in team strength. Each game outcome only provides insight on the <em>relative</em> strength of the teams, so the values of <span class="math inline">\(\mu_h\)</span> and <span class="math inline">\(\mu_a\)</span> are not identified, only the differences.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> One could, of course, model home points and away points, with separate offensive and defensive HFAs. However, modeling positive, bivariate outcomes gets much more complicated and the primary question of measuring total home-field advantage would ultimately result in estimating a quantity very similar to <span class="math inline">\(\alpha_0\)</span>.</p>
<p>I will also look at simple wins and losses rather than scores to measure home-field advantage. In this case, let the outcome <span class="math inline">\(W_{hag}\)</span> be a binary variable, 1 for if the home team wins and 0 otherwise. Let <span class="math inline">\(p_{hag} = P(W_{hag} = 1)\)</span>, the probability of a home team win. The model is <span class="math inline">\(logit(p) = log\left(\frac{p}{1-p}\right) = \alpha_0 + \mu_h - \mu_a\)</span>, where <span class="math inline">\(logit(p)\)</span> refers to the log-odds of the home team winning. The same sort of interpretation applies, just with some slight transformations to account for the fact that the parameters can be any real number while <span class="math inline">\(p_{hag}\)</span> needs to be between 0 and 1. <span class="math inline">\(e^{\alpha_0}\)</span> is the odds of the home team winning given equal skill. <span class="math inline">\(e^{\mu_h - \mu_a}\)</span> is the odds that team <span class="math inline">\(h\)</span> beats team <span class="math inline">\(a\)</span> on a neutral field. Note the same identification problem arises in the scale of <span class="math inline">\(\mu_i\)</span>.<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a></p>
<p>Both models might not look it, but they are actually linear models (or generalized linear for win probabilities) that can be estimated with standard regression packages, just with a clever design matrix. To address the identifiability issues, I enforce a constraint that <span class="math inline">\(\sum_i \mu_i = 0\)</span>. Both models, points and wins, are essentially extensions of the Bradley-Terry model for paired comparisons. See the next section (Implementation Details) for said implementation details. Data, along with team logos and colors used later, were collected from the <code>nflfastR</code> package by Sebastian Carl and Ben Baldwin.</p>
<p><img src="https://g-tierney.github.io/post/home_field_files/figure-html/plot_results-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The left panel shows the estimated expected points advantage for the home teams in each season given an opponent of equal strength. The error bars show the 95% confidence intervals. For all but two seasons prior to 2019, we can reject the null that the home-field points advantage is 0. 2019 and 2020 have estimated home-field advantages of -0.075 and 0.114 respectively. Failing to reject the null is of course different from concluding the null is true, but it would be quite challenging to get point estimates closer to zero than what we observe.</p>
<p>In terms of home win probabilities, the results are similar. When two equally skilled teams play, prior to 2019, the home team has an about 60% chance to win. In 2019 and 2020, the confidence intervals exclude 60% for the first time and the home team only had about a 52% and 51% chance of winning, both statistically indistinguishable from a 50-50 chance.</p>
</div>
<div id="implementation-details" class="section level2">
<h2>Implementation Details</h2>
<p>If you just want to know which teams have the biggest home-field advantages, go ahead and skip to the next section. I assume if you are still here that you have familiarity with multivariate regression and using statistical packages to implement them. I found estimating the previous section's model non-trivial, interesting, and a useful learning experience for dealing with complicated contrasts in OLS and GLM scenarios, so I thought it could help others to document it here.</p>
<p>Here I will describe how I actually implemented the models, focusing on the points model but both are essentially the same. The statement <span class="math inline">\(Y_{hag} \sim N(\alpha_0 + \mu_h - \mu_a,\sigma^2)\)</span> nicely expresses the model, but it does not look like a standard linear model of the form <span class="math inline">\(Y = \beta_0 + x_1 \beta_1 + \epsilon\)</span> that most statistical packages request, mostly because of the minus sign and the fact that teams switch sides, playing both home and away games. To use standard estimation tools, essentially, for each game <span class="math inline">\(g\)</span>, we need to come up with a vector of variables <span class="math inline">\(\mathbf{x}_g\)</span> such that estimated coefficients <span class="math inline">\(\mathbf{\beta}\)</span> simplify to: <span class="math inline">\(\mathbf{\beta}^T \mathbf{x}_g = \beta_1 x_{g1} + \beta_2 x_{g2} + \ldots = \alpha_0 + \mu_h - \mu_a\)</span>.</p>
<p>Getting <span class="math inline">\(\alpha_0\)</span> is easy, make <span class="math inline">\(x_{g1} = 1\)</span> for every game and we get the intercept. Most stats packages don't actually require you to specify the intercept because it is usually not interpretable, but it is our main variable of interest here. Then, for each team <span class="math inline">\(i\)</span> let <span class="math inline">\(z_{gi} = 1\)</span> if team <span class="math inline">\(i\)</span> is the home team in game <span class="math inline">\(g\)</span>, <span class="math inline">\(-1\)</span> if <span class="math inline">\(i\)</span> is the away team, and 0 otherwise. If we stack these <span class="math inline">\(z\)</span> variables together, <span class="math inline">\(\mu_1 z_{g1} + \mu_2 z_{g2} + \ldots + \mu_{32} z_{g32} = \mu_h - \mu_a\)</span>. But we aren't quite done, as you'll notice I've called these variables <span class="math inline">\(z\)</span> and not the <span class="math inline">\(x\)</span> that we are interested in. If you try to estimate an intercept plus 32 (one for each NFL team) strength variables, <span class="math inline">\(z_{32}\)</span> won't estimate. That's because if you add up <span class="math inline">\(z_1\)</span> through <span class="math inline">\(z_{31}\)</span> then multiply by -1, you'll get <span class="math inline">\(z_{32}\)</span> (proof left as an exercise for the reader). This is essentially the identification problem coming back up. Just dropping one of the team strength variables doesn't quite work because the intercept becomes the expected point differential of whatever team was dropped playing at home against a team of strength 0. The dropping uses the identification constraint <span class="math inline">\(\mu_{32} = 0\)</span> and treats team 32 as the &quot;baseline&quot; team.</p>
<p>The constraint we really want is not for one of the team strength variables to be zero, but rather for them to sum to zero. There is no nice way to tell the software this information because the home and away team information is stored in two different columns of the dataset. So, we have to do it ourselves. In terms of the coefficients, we know <span class="math inline">\(\sum_{i=1}^{32} \mu_i = 0\)</span> so we can write <span class="math inline">\(\sum_{i=1}^{31} = -\mu_{32}\)</span>. Thus, we can express the regression equation as:</p>
<span class="math display">\[\begin{align*}
E[Y_g] &amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i z_{i} + \mu_{32} z_{32} \\
&amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i z_{i} - \left(\sum_{i=1}^{31} \mu_i\right) z_{32} \\
&amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i (z_{i} - z_{32}) \\
\end{align*}\]</span>
<p>And here we have the final result! Set the variables <span class="math inline">\(x_{1} = 1\)</span> for the intercept and <span class="math inline">\(x_{i+1} = z_i - z_{32}\)</span> for <span class="math inline">\(i\)</span> in 1 to 31 (number of teams <span class="math inline">\(-1\)</span>), and you have the full equation. We've used the desired constraint to augment the trinary team indicator variables such that the intercept directly measures our quantity of interest. To get the team strength estimates, <span class="math inline">\(\mu_i\)</span> is the regression coefficient on <span class="math inline">\(z_i-z_{32}\)</span> and <span class="math inline">\(\mu_{32}\)</span> is the opposite of the sum of the other <span class="math inline">\(\mu_i\)</span> terms (ensuring they all sum to zero). Thinking about uncertainty in the estimates of <span class="math inline">\(\mu_i\)</span> terms is hard because they are all related to each other, if one is an overestimate another must be an underestimate. Dealing with those is outside of the scope of this post, but I may return to it later.</p>
</div>
<div id="home-field-advantage-by-team" class="section level2">
<h2>Home-Field Advantage by Team</h2>
<p>Finally, I'll look at home-field advantage by team. I'll just do this for the points model because the aforementioned issues with 16-0 and 0-16 seasons get even more common when a home undefeated or win-less season comes up. The extension to the model is simple, I just add a sub script: <span class="math inline">\(Y_{hag} \sim N(\alpha_h + \mu_h - \mu_a,\sigma^2)\)</span>. Now I've written <span class="math inline">\(\alpha_h\)</span> not <span class="math inline">\(\alpha_0\)</span> to note that the home-field advantage is home team specific. Estimating this parameter is, however, a bit trickier. In the last model, <span class="math inline">\(\alpha_0\)</span> was informed directly by all 256 games in a season. Each <span class="math inline">\(\alpha_h\)</span> is informed by only 8. We can use a similar implementation as above to get a best-guess of each parameter (the maximum likelihood estimate), but those estimates will be quite noisy. Consequently, I will put my Bayesian hat back on and use a hierarchical model for the home team advantage terms: <span class="math inline">\(\alpha_h \sim N(\alpha_0,\sigma_\alpha^2)\)</span>. I assume that the <span class="math inline">\(\alpha_h\)</span> terms all are drawn from a common distribution. This shrinks the estimates each season towards the &quot;typical&quot; home-field advantage and gives some slight regularization so we done make too extreme estimates given limited data. The model is implemented in stan, and you can find the code over on my <a href="https://github.com/g-tierney/NFL_HFA">GitHub page here</a>.</p>
<p>From the model, I recover posterior beliefs about <span class="math inline">\(\alpha_0\)</span>, the home-field advantage of a typical team in each season, and each <span class="math inline">\(\alpha_h\)</span>, the home-field advantage for team <span class="math inline">\(h\)</span> in a season. Another important (and new) variable is <span class="math inline">\(\sigma_{\alpha}\)</span>. This is the standard deviation of home-field advantage across teams in a given season. Standard deviations are easy to interpret uncertainty measures: about 50% of the actual <span class="math inline">\(\alpha_h\)</span> values will be in the interval <span class="math inline">\(\alpha_0\pm\sigma_\alpha\)</span> and nearly all of the values will be within <span class="math inline">\(\alpha_0\pm 2\sigma_\alpha\)</span> (about 1 or 2 <span class="math inline">\(\alpha_h\)</span> values will fall outside of that interval each season). The figure below shows those results.</p>
<p><img src="https://g-tierney.github.io/post/home_field_files/figure-html/dist_results-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The Typical HFA measures much an average team would be favored at home when playing an equally skilled opponent. These results track with the above results assuming a constant home-field advantage across teams, but for some years the error bars have gotten wider. Certain years, such as 2003 and 2008, had much more variable home-field advantages, which will increase uncertainty in the behavior of an average team. 2008 had most estimates ranging from the home team being favored by about 6 points to being two point underdogs. The next two plots break out the estimates by team. The top panel shows every team and the bottom just the largest and smallest home-field advantages.</p>
<p><img src="https://g-tierney.github.io/post/home_field_files/figure-html/team_results-1.png" width="672" style="display: block; margin: auto;" /><img src="https://g-tierney.github.io/post/home_field_files/figure-html/team_results-2.png" width="672" style="display: block; margin: auto;" /></p>
<p>The top plot shows the wide range of home advantages even within a single year, with lines connecting each team's estimate. Starting around 2015, that variability drops off and teams all start to look very similar to each other. Most of the time, the worst home-field advantage is about 0. The outlier in 2008 was the Detroit Lions, who were expected to lose by 5 points at home when playing that they would tie on a neutral field. This was of course the season the Lions went 0-16, losing home games by a significantly larger margin than away games. The bottom plot just picks out the best and worst teams. There is significant turnover year-to-year in the NFL and that pattern continues into home-field advantage. The year prior to the Lions' historically bad year, they had the largest home-field advantage. The 9ers, Dolphins, Jaguars, Panthers, and Steelers also had the smallest and largest home-field advantages in different years, although none of them managed it in consecutive years. To try and review all teams, the table below reports the average and standard deviation of <span class="math inline">\(\alpha_h\)</span> across all seasons for each team.</p>
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:center;">
</th>
<th style="text-align:left;">
Team
</th>
<th style="text-align:center;">
HFA (Mean)
</th>
<th style="text-align:center;">
HFA (SD)
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/bal.png" width="30" />
</td>
<td style="text-align:left;">
BAL
</td>
<td style="text-align:center;">
3.08
</td>
<td style="text-align:center;">
1.57
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/gb.png" width="30" />
</td>
<td style="text-align:left;">
GB
</td>
<td style="text-align:center;">
2.93
</td>
<td style="text-align:center;">
1.71
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/ne.png" width="30" />
</td>
<td style="text-align:left;">
NE
</td>
<td style="text-align:center;">
2.90
</td>
<td style="text-align:center;">
1.60
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/sea.png" width="30" />
</td>
<td style="text-align:left;">
SEA
</td>
<td style="text-align:center;">
2.79
</td>
<td style="text-align:center;">
1.47
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/pit.png" width="30" />
</td>
<td style="text-align:left;">
PIT
</td>
<td style="text-align:center;">
2.62
</td>
<td style="text-align:center;">
1.50
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/ind.png" width="30" />
</td>
<td style="text-align:left;">
IND
</td>
<td style="text-align:center;">
2.59
</td>
<td style="text-align:center;">
1.19
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/dal.png" width="30" />
</td>
<td style="text-align:left;">
DAL
</td>
<td style="text-align:center;">
2.59
</td>
<td style="text-align:center;">
1.62
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/min.png" width="30" />
</td>
<td style="text-align:left;">
MIN
</td>
<td style="text-align:center;">
2.58
</td>
<td style="text-align:center;">
1.59
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/lar.png" width="30" />
</td>
<td style="text-align:left;">
LA
</td>
<td style="text-align:center;">
2.47
</td>
<td style="text-align:center;">
1.87
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/sf.png" width="30" />
</td>
<td style="text-align:left;">
SF
</td>
<td style="text-align:center;">
2.47
</td>
<td style="text-align:center;">
1.76
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/no.png" width="30" />
</td>
<td style="text-align:left;">
NO
</td>
<td style="text-align:center;">
2.47
</td>
<td style="text-align:center;">
2.19
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/phi.png" width="30" />
</td>
<td style="text-align:left;">
PHI
</td>
<td style="text-align:center;">
2.38
</td>
<td style="text-align:center;">
1.49
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/buf.png" width="30" />
</td>
<td style="text-align:left;">
BUF
</td>
<td style="text-align:center;">
2.38
</td>
<td style="text-align:center;">
1.55
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/lac.png" width="30" />
</td>
<td style="text-align:left;">
LAC
</td>
<td style="text-align:center;">
2.36
</td>
<td style="text-align:center;">
1.44
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/den.png" width="30" />
</td>
<td style="text-align:left;">
DEN
</td>
<td style="text-align:center;">
2.34
</td>
<td style="text-align:center;">
1.53
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/kc.png" width="30" />
</td>
<td style="text-align:left;">
KC
</td>
<td style="text-align:center;">
2.32
</td>
<td style="text-align:center;">
1.90
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/ten.png" width="30" />
</td>
<td style="text-align:left;">
TEN
</td>
<td style="text-align:center;">
2.30
</td>
<td style="text-align:center;">
1.53
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/hou.png" width="30" />
</td>
<td style="text-align:left;">
HOU
</td>
<td style="text-align:center;">
2.27
</td>
<td style="text-align:center;">
1.17
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/chi.png" width="30" />
</td>
<td style="text-align:left;">
CHI
</td>
<td style="text-align:center;">
2.25
</td>
<td style="text-align:center;">
1.25
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/ari.png" width="30" />
</td>
<td style="text-align:left;">
ARI
</td>
<td style="text-align:center;">
2.23
</td>
<td style="text-align:center;">
1.64
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/tb.png" width="30" />
</td>
<td style="text-align:left;">
TB
</td>
<td style="text-align:center;">
2.16
</td>
<td style="text-align:center;">
1.50
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/jax.png" width="30" />
</td>
<td style="text-align:left;">
JAX
</td>
<td style="text-align:center;">
2.14
</td>
<td style="text-align:center;">
1.26
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500-dark/car.png" width="30" />
</td>
<td style="text-align:left;">
CAR
</td>
<td style="text-align:center;">
2.12
</td>
<td style="text-align:center;">
1.91
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/nyj.png" width="30" />
</td>
<td style="text-align:left;">
NYJ
</td>
<td style="text-align:center;">
2.07
</td>
<td style="text-align:center;">
1.25
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/atl.png" width="30" />
</td>
<td style="text-align:left;">
ATL
</td>
<td style="text-align:center;">
2.02
</td>
<td style="text-align:center;">
1.39
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/mia.png" width="30" />
</td>
<td style="text-align:left;">
MIA
</td>
<td style="text-align:center;">
2.02
</td>
<td style="text-align:center;">
1.40
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/cin.png" width="30" />
</td>
<td style="text-align:left;">
CIN
</td>
<td style="text-align:center;">
1.96
</td>
<td style="text-align:center;">
1.52
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/det.png" width="30" />
</td>
<td style="text-align:left;">
DET
</td>
<td style="text-align:center;">
1.92
</td>
<td style="text-align:center;">
2.28
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/lv.png" width="30" />
</td>
<td style="text-align:left;">
LV
</td>
<td style="text-align:center;">
1.79
</td>
<td style="text-align:center;">
1.37
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/nyg.png" width="30" />
</td>
<td style="text-align:left;">
NYG
</td>
<td style="text-align:center;">
1.75
</td>
<td style="text-align:center;">
1.47
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/cle.png" width="30" />
</td>
<td style="text-align:left;">
CLE
</td>
<td style="text-align:center;">
1.64
</td>
<td style="text-align:center;">
0.94
</td>
</tr>
<tr>
<td style="text-align:center;">
<img src="https://a.espncdn.com/i/teamlogos/nfl/500/wsh.png" width="30" />
</td>
<td style="text-align:left;">
WAS
</td>
<td style="text-align:center;">
1.59
</td>
<td style="text-align:center;">
1.21
</td>
</tr>
</tbody>
</table>
<p>The Ravens, Packers, Patriots, and Seahawks have the largest average home-field advantages at around 3 to 2.75 point favorites at home. The Lion were only the 5th lowest on average, despite their 2008 results. Washington, the Browns, and the Giants are the three worst home teams. The Jets, who play at the same home stadium as the Giants, are about 2 points better at home than away while the Giants are 1.76 points better. Surprisingly, each team has basically the same standard deviation of home-field advantage at around 2.5 points.</p>
</div>
<div id="conclusion" class="section level2">
<h2>Conclusion</h2>
<p>I set out to try and see if the evaporation of home-field advantage in 2020 looked different enough from previous years to claim that the COVID-19 fan and travel restrictions might be a cause of the decline. I found that, yes, home-field advantage was extremely small to non-existent in 2020, but that drop also happened last year in 2019. A few articles discussed it in 2019, but the lack of home-field advantage was discussed much more this year in the context of the global pandemic. However, I don't think the pandemic can be blamed. In recent years, home-field advantage has been very similar across teams, and it dropped to essentially zero last year before the pandemic. Certainly my model could be improved, maybe week 17 games where starters are resting should be dropped, maybe garbage time scores should be removed too, and certainly team strength varies over the course of a season. But I suspect even with more robustness checks and sophisticated tools, the result will remain the same. Other good work on home-field advantage using different methods came to essentially the same conclusions.<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a> The home-field advantage disappeared last year, before anyone had heard of COVID-19.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Three 49ers games were moved to a neutral field, the Cardinal's home stadium, due to COVID-19 restrictions.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>This kind of problem would still hold if one modeled home and away scores separately, rather than just the difference. The home team's score only provides information in the home offense relative to the away defense.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Regular season football games can end in ties. They are rare enough that I chose to simply code ties as home team losses. Dropping them or changing them to home team wins do not meaningfully change the results because there are so few (10 out of 5,778 games since 1999). Teams with &quot;perfect&quot; records of 16-0 or 0-16 pose challenges as well. The MLE for their skill is <span class="math inline">\(\pm \infty\)</span> because they always win or always lose. This is mostly an issue for interpretation of the team strength variables, but it does make other estimates a bit unstable as well.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>A recent post by <a href="https://www.opensourcefootball.com/posts/2021-01-11-hfa-analysis/#adjusting-home-field-advantage">Adrian Cadena on Open Source Football</a> gives a good overview and similar analysis.<a href="#fnref4">↩</a></p></li>
</ol>
</div>
</description>
</item>
<item>
<title>Why you shouldn’t “hold-out” data in survival-model predictions</title>
<link>https://g-tierney.github.io/post/survival_hold_out_writeup/</link>
<pubDate>Tue, 08 Dec 2020 00:00:00 +0000</pubDate>
<guid>https://g-tierney.github.io/post/survival_hold_out_writeup/</guid>
<description>
<link href="https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.css" rel="stylesheet" />
<script src="https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.js"></script>
<p>In nearly all cases, the proper way to make predictions on a subset of your data is by holding-out the data you want to predict, training a model on the remaining data, then predicting the outcome on the held-out data using the trained model. The reason is that this procedure ostensibly captures how you would use this model in practice: train the model on all the data you have, then predict for new data where the outcome is unknown. Cross-validation follows this procedure as well. However, that logic (slightly) broke down for an assignment in a class I TA'ed this semester. The confusion was common enough that I thought it warranted some deeper explanation. This post summarizes an answer I gave during office hours and assumes an advanced undergraduate level of statistics background, along with familiarity with Bayesian statistics.</p>
<p>Suppose you are modeling the lifespan of world leaders. You are given a dataset of Popes, US Presidents, Dali Lamas, Japanese Emperors, and Chinese Emperors. The data include various demographic data: how long they lived, the age and year they assumed office, position held, they year they died, and if they are currently living. The task given to students was to predict how much longer the currently living leaders would survive (5 Presidents, 2 Japanese Emperors, 2 Popes, and 1 Dalai Lama).<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> Should you train a model on the deceased leaders, then predict for lifespan for the living leaders? Many students took this approach. The answer, as you can surmise from the fact that this post exists, is no. You can, and should, train a lifespan model using the data from living leaders as well.</p>
<p>But first, some notation. In the traditional hold-out method, you pretend you do not know the outcome <span class="math inline">\(Y_i\)</span> for some data <span class="math inline">\(i\)</span> in your hold-out set and you predict that <span class="math inline">\(Y_i\)</span> using <span class="math inline">\(X_i\)</span>, covariate information on unit <span class="math inline">\(i\)</span>, and <span class="math inline">\(\{Y_j,X_j\}\)</span> for <span class="math inline">\(j\)</span> in the observed or training data. That is, you build a model that estimates <span class="math inline">\(Y_j\)</span> given data <span class="math inline">\(X_j\)</span>, then apply that model to the hold-out data <span class="math inline">\(X_i\)</span> to get an estimate of <span class="math inline">\(Y_i\)</span>. I will refer to the set of all fully observed data <span class="math inline">\(\{Y_j,X_j\}\)</span> as <span class="math inline">\(Y^{obs}\)</span>.</p>
<p>To make this a little more concrete, suppose you believe that lifespan for leaders follows a log-normal distribution, such that <span class="math inline">\(log(Y_i) \sim N(\beta_0 + \beta_1 X_{i1} + \ldots,\sigma^2)\)</span>. That is, the mean is a linear function of the predictors with a common variance term.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> The form of the model is not particularly important here, just that it has some sort of structure. If we know the parameters <span class="math inline">\(\beta\)</span> and <span class="math inline">\(\sigma^2\)</span>, then we wouldn't need any training data at all. We know the underlying process and can simply predict lifespans for living leaders using the parameters.</p>
<p>Of course, we don't know the parameters. But we can learn the parameters from the training data and use them to predict the outcome. In Bayesian statistics this quantity is called the posterior-predictive distribution. We are interested in describing <span class="math inline">\(p(Y_i|X_i,Y^{obs})\)</span>, our beliefs or uncertainty about <span class="math inline">\(Y_i\)</span> from the hold-out set given our observed data <span class="math inline">\(Y^{obs}\)</span>. Omitting <span class="math inline">\(X_i\)</span> for clarity, this quantity can be analytically expressed as the following:</p>
<p><span class="math display">\[p(Y_i|Y^{obs}) = \int p(Y_i|\beta,\sigma^2) p(\beta,\sigma^2|Y^{obs}) \ d\beta d\sigma^2\]</span></p>
<p>Essentially, <span class="math inline">\(p(Y_i|Y^{obs})\)</span> is a weighted average of the assumed distribution, in this case log-normal, over the parameter space with parameter weights determined by their posterior density. <span class="math inline">\(p(\beta,\sigma^2|Y^{obs})\)</span> is the posterior distribution for the parameters given only the training data. Given samples from the posterior, one can sample from <span class="math inline">\(p(Y_i|\beta,\sigma^2)\)</span> to approximate the posterior predictive distribution.</p>
<p>If you know <em>nothing</em> about <span class="math inline">\(Y_i\)</span> then the hold-out method is correct and really the only option. You can't learn from observations <span class="math inline">\(i\)</span> where you don't know anything about the outcome.</p>
<p>However, for survival data we do know <em>something</em> about the outcome. We know living leaders will live to at least their current age. The you really want to estimate <span class="math inline">\(p(Y_i|X_i,Y^{obs},\mathbf{Y_i &gt;c_i})\)</span> where <span class="math inline">\(c_i\)</span> is the living leader's current age. You wouldn't want to predict Jimmy Carter would only live to be 94 because he is currently 96. The expression from above becomes the following:</p>
<p><span class="math display">\[p(Y_i|Y^{obs},Y_i&gt;c_i) = \int p(Y_i|\beta,\sigma^2,Y_i&gt;c_i) p(\beta,\sigma^2|Y^{obs},Y_i&gt;c_i) \ d\beta d\sigma^2\]</span></p>
<p>The key difference is the term <span class="math inline">\(p(\beta,\sigma^2|Y^{obs},Y_i&gt;c_i)\)</span>. This is still a posterior distribution but it is not the same posterior distribution as before because it includes the information on additional leaders who have lived at least <span class="math inline">\(c_i\)</span> years. There are five currently-living Presidents and the fact that they have reached their current ages should inform your beliefs about word leader life expectancy. If you use the hold-out method, you might predict a currently-living leader will die in the past, which is obviously wrong. If you simply force your predictions to predict time-of-deaths in the future, then you have trained your model on incomplete data and used the wrong posterior distribution. You modeled <span class="math inline">\(Y_i|Y^{obs},Y_i&gt;c_i\)</span> but learned your parameters <span class="math inline">\(\beta\)</span> and <span class="math inline">\(\sigma^2\)</span> only from <span class="math inline">\(Y^{obs}\)</span> rather than <span class="math inline">\(Y^{obs}\)</span> and <span class="math inline">\(Y_i&gt;c_i\)</span>. I've used Bayesian formulations here because they provide nice ways to estimate survival models and make the distinction between the input data clear, but the logic applies to any estimation of future event times.</p>
<p>Hold-out predictions and cross-validation procedures can be deceptively complex. Your predictive model should replicate how you will actually use it in practice. If you want to predict event times in the future, you should include in your model that those events have not happened yet. Including that information can be hard and may require a more complex estimation of the parameters given the data because the likelihood is now a product of densities <span class="math inline">\(p(Y_j)\)</span> and survival functions <span class="math inline">\(P(Y_i&gt;c_i)\)</span>.<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> But it is certainly the “correct” way to do it because it includes all of the data currently available.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>The actual assignment had more components and is available <a href="https://amy-herring.github.io/STA440/leaders.html">here</a>.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>The students were tasked with using a more complicated model that is generally better for survival analysis but too complicated for exposition here. The assignment was based on expanding this paper: Stander, J., Dalla Valle, L., and Cortina-Borja, M. (2018). A Bayesian Survival Analysis of a Historical Dataset: How Long Do Popes Live? The American Statistician 72(4):368-375.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Of course, if you are a Bayesian, that combination is trivial.<a href="#fnref3">↩</a></p></li>
</ol>
</div>
</description>
</item>
<item>
<title>What About the Emails?</title>
<link>https://g-tierney.github.io/post/political_emails/</link>
<pubDate>Fri, 07 Dec 2018 00:00:00 +0000</pubDate>
<guid>https://g-tierney.github.io/post/political_emails/</guid>
<description>
<script src="https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js"></script>
<div id="the-project" class="section level1">
<h1>The Project</h1>
<p>One fateful day while I was bored during a lecture, I decided to sign up for email messages from each Senate campaign during the 2018 election cycle.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> A lot of people study the impact of political advertisements on various outcomes, and I thought some interesting trends might emerge in the email blasts that campaigns send out.</p>
</div>
<div id="the-data" class="section level1">
<h1>The Data</h1>
<p>I signed up using a new email address, and only filled out the required fields to get on mailing lists. I used my real name, the zip code 00000, and a phone number of all zeros. The data for what information I gave to each campaign is on Github. I felt kind of bad signing up for volunteer lists with false information, which were the only email option for some campaigns. The data turned out to be pretty interesting though, so I think next cycle for the presidential election I will try to get on a more comprehensive set of emails by signing up for the House races too and providing zip codes and phone numbers in the relevant district.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a></p>
<p>I found the candidates and campaign websites from <a href="https://www.realclearpolitics.com/epolls/2018/senate/2018_elections_senate_map.html">RealClearPolitics’s Senate map</a>. I started signing up for emails on 6/6/2018, but didn’t sign up for every senate race until 10/9/2018. In the last month before the election (October 6 through November 6 inclusive) I received 2,650 emails from 50 unique campaigns. Some campaigns did not have an option to signup for an email list on their website and some may have filtered out my email address because the zip code and/or phone number were clearly not accurate. Much of the analysis below compares emails from Democrats and Republicans, so I further filter the emails down to races where I received at least one email from both party’s candidates. The final number is 2,397 emails and 44 campaigns. Most analysis uses this sample, and I will specify when that is not the case.</p>
</div>
<div id="who-is-sending-emails-and-when" class="section level1">
<h1>Who is sending emails and when?</h1>
<p>I received emails from both parties in the following states: AZ, FL, IN, MA, MD, MI, MN, MO, MS, ND, NE, NJ, NV, NY, OH, PA, TN, TX, USA, UT, VA, VT, WA, WY. A notable omission is West Virginia, where I only received emails from Joe Manchin. Below I show the number of emails I received each day from each party.</p>
<p><img src="https://g-tierney.github.io/post/political_emails_files/figure-html/who_emails-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>What immediately jumped out to me was that the democratic candidates send significantly more emails (and for some reason campaigns send the fewest emails on Wednesdays). Next, I tally the number of emails I received from each candidate, and show the races where I received at least 100 emails in total.</p>
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;">
State
</th>
<th style="text-align:left;">
Party
</th>
<th style="text-align:left;">
Campaign
</th>
<th style="text-align:right;">
Total Emails
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
NV
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Jacky Rosen
</td>
<td style="text-align:right;">
352
</td>
</tr>
<tr>
<td style="text-align:left;">
NV
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Dean Heller
</td>
<td style="text-align:right;">
147
</td>
</tr>
<tr>
<td style="text-align:left;">
FL
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Bill Nelson
</td>
<td style="text-align:right;">
237
</td>
</tr>
<tr>
<td style="text-align:left;">
FL
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Rick Scott
</td>
<td style="text-align:right;">
7
</td>
</tr>
<tr>
<td style="text-align:left;">
MO
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Claire McCaskill
</td>
<td style="text-align:right;">
185
</td>
</tr>
<tr>
<td style="text-align:left;">
MO
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Josh Hawley
</td>
<td style="text-align:right;">
30
</td>
</tr>
<tr>
<td style="text-align:left;">
ND
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Heidi Heitkamp
</td>
<td style="text-align:right;">
176
</td>
</tr>
<tr>
<td style="text-align:left;">
ND
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Kevin Cramer
</td>
<td style="text-align:right;">
29
</td>
</tr>
<tr>
<td style="text-align:left;">
AZ
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Kyrsten Sinema
</td>
<td style="text-align:right;">
96
</td>
</tr>
<tr>
<td style="text-align:left;">
AZ
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Martha McSally
</td>
<td style="text-align:right;">
69
</td>
</tr>
<tr>
<td style="text-align:left;">
IN
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Joe Donnelly
</td>
<td style="text-align:right;">
106
</td>
</tr>
<tr>
<td style="text-align:left;">
IN
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Mike Braun
</td>
<td style="text-align:right;">
50
</td>
</tr>
<tr>
<td style="text-align:left;">
USA
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
DNC
</td>
<td style="text-align:right;">
59
</td>
</tr>
<tr>
<td style="text-align:left;">
USA
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
RNC
</td>
<td style="text-align:right;">
73
</td>
</tr>
<tr>
<td style="text-align:left;">
MN
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Amy Klobuchar
</td>
<td style="text-align:right;">
19
</td>
</tr>
<tr>
<td style="text-align:left;">
MN
</td>
<td style="text-align:left;">
D
</td>
<td style="text-align:left;">
Tina Smith
</td>
<td style="text-align:right;">
62
</td>
</tr>
<tr>
<td style="text-align:left;">
MN
</td>
<td style="text-align:left;">
R
</td>
<td style="text-align:left;">
Karin Housely
</td>
<td style="text-align:right;">
38
</td>
</tr>
</tbody>
</table>
<p>Jacky Rosen, Bill Nelson, Claire McCaskill, Heidi Heitkamp, and Joe Donnely (all democrats) sent over 100 emails during the relevant time-frame. Dean Heller was the only republican who sent me over 100 emails. In general, Democrats sent more emails than their Republican opponents. However, I certainly would not be surprised if my sample was biased. People who signed up with in-state addresses probably received more emails than I did. I don’t know how sophisticated campaigns are with targeting their emails, but I would be shocked if they did not focus efforts like Get Out the Vote campaigns on people with addresses in their district. I do wonder, though, if there is a connection between the fundraising strategies and email strategies of each party. If Democrats rely more on smaller donations from many individuals, they might need to send more emails to everyone who expresses interest in their campaign. Republicans who either self-fund or court fewer donations from wealthier individuals might simply not have much to gain from emailing out-of-state individuals.</p>
</div>
<div id="email-content" class="section level1">
<h1>Email Content</h1>
<p><img src="https://g-tierney.github.io/post/political_emails_files/figure-html/word_clouds-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Next, I will analyze the content of the emails. The word clouds above show each word sized proportionally to the amount of times it was used in the email. Both party’s emails frequently used words like senate, vote, and fight, but some differences are already apparent. Here I do unfortunately need to filter the dataset down more. Many of the emails that were sent came in a format that did not download or parse into human-readable text well. I tried to extract the text from all emails, but some (particularly ones with odd formatting or with pictures of text) I could not parse properly. The number of emails analyzed in this section is only 2,043.</p>
<p>Word clouds are a useful visualization, but I will use a statistical technique, the relative risk ratio, to characterize the difference in word-usage between Republican and Democratic emails.</p>
<p><img src="https://g-tierney.github.io/post/political_emails_files/figure-html/word_counts-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The chart above merits some further explanation. For each word, I calculated the proportion of Democratic emails that used the word and the proportion of Republican emails that used the word. Then, I took the ratio of those two quantities, often referred to as the relative risk ratio. To put everything on a comparable scale, if Republican emails used the word more frequently, I multiplied the proportion by -1 and took the inverse. So if the ratio is equal to R and positive, then Democratic emails used the word R times more frequently. If the ratio is equal to R and negative, then Republican emails used the word R times more frequently. I show the 15 words with the largest ratio for each party.<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a></p>
<p>Something that I noticed quickly is that election-specific terms from races where the Democrat sent many more emails than the Republican have high risk ratios: “Las Vegas”, “Scott” (Bill Nelson referring to his opponent Rick Scott), and “FL” are all in the top 15 for democrats. “ActBlue” is an organization that helps Democrats fundraise. “politicalemaild” is a truncated version of the email address I provided to campaigns.</p>
<p>I was not surprised that the words “borders” and “conservative” are in the top for Republicans, but I was quite surprised by “web” and “website” showing up in the top 15. “Liberal” is used frequently as a pejorative by Republicans, but apparently Democrats do not use the word anywhere near as frequently when messaging their own supporters.</p>
<p>Of course, words that are used by one party and never used by the other will have a risk ratio of plus or minus infinity. The problem with looking at all of those words is that they often are spelling or parsing errors that happen once or twice for one party and never for the other. To account for that, I show only the 10 most frequently used words that are never used by the opposing party.</p>
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="border-bottom:hidden" colspan="1">
</th>
<th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2">
<div style="border-bottom: 1px solid #ddd; padding-bottom: 5px;">
Democratic
</div>
</th>
<th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2">
<div style="border-bottom: 1px solid #ddd; padding-bottom: 5px;">
Republican
</div>
</th>
</tr>
<tr>
<th style="text-align:left;">
Word
</th>
<th style="text-align:right;">
Emails Using Word
</th>
<th style="text-align:right;">
Proportion Using Word
</th>
<th style="text-align:right;">
Emails Using Word
</th>
<th style="text-align:right;">
Proportion Using Word
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
youd
</td>
<td style="text-align:right;">
625
</td>
<td style="text-align:right;">
0.287
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
0.000
</td>
</tr>
<tr>
<td style="text-align:left;">
mitch
</td>
<td style="text-align:right;">
307
</td>
<td style="text-align:right;">
0.141
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
0.000
</td>
</tr>
<tr>
<td style="text-align:left;">
mcconnell
</td>
<td style="text-align:right;">
299
</td>
<td style="text-align:right;">
0.138
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
0.000
</td>
</tr>
<tr>
<td style="text-align:left;">
environment
</td>
<td style="text-align:right;">
262
</td>
<td style="text-align:right;">
0.121
</td>
<td style="text-align:right;">
0