index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Graham Tierney on Graham Tierney</title>
    <link>https://g-tierney.github.io/</link>
    <description>Recent content in Graham Tierney on Graham Tierney</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2018</copyright>
    <lastBuildDate>Sun, 15 Oct 2017 00:00:00 -0400</lastBuildDate>
    <atom:link href="/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Anonymous Cross-Party Conversations Can Decrease Political Polarization: A Field Experiment on a Mobile Chat Platform</title>
      <link>https://g-tierney.github.io/publication/mediation_sensitivity/</link>
      <pubDate>Thu, 23 Sep 2021 00:00:00 -0400</pubDate>
      
      <guid>https://g-tierney.github.io/publication/mediation_sensitivity/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Sensitivity Analysis for Causal Mediation through Text: an Application to Political Polarization</title>
      <link>https://g-tierney.github.io/publication/discussit/</link>
      <pubDate>Thu, 23 Sep 2021 00:00:00 -0400</pubDate>
      
      <guid>https://g-tierney.github.io/publication/discussit/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Author Clustering and Topic Estimation for Short Texts</title>
      <link>https://g-tierney.github.io/publication/stldac/</link>
      <pubDate>Tue, 15 Jun 2021 00:00:00 -0400</pubDate>
      
      <guid>https://g-tierney.github.io/publication/stldac/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Is the NFL&#39;s Home-Field Advantage Over?</title>
      <link>https://g-tierney.github.io/post/home_field/</link>
      <pubDate>Mon, 11 Jan 2021 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/home_field/</guid>
      <description>
&lt;link href=&#34;https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Home-field advantage (HFA) has been documented in many sports, and much speculation has covered the potential causal mechanisms (referee bias, travel times, crowd reactions, etc.). In the NFL, historically, playing at home has offered about a three point advantage in the point differential, the equivalent of a field goal. However, with COVID-19 imposed restrictions, many home games were conducted without fans, and the home team&#39;s advantage nearly disappeared. Home teams won 127 of 253 games (50.2%),&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; scoring on average only 0.01 more points than their opponents as opposed to the typical 3.00.&lt;/p&gt;
&lt;p&gt;I began this project to see if I could plausibly measure the home-field advantage season-by-season to see just how unusual 2020 was. Indeed, the raw statistics are historic lows for the NFL. 2020 saw the third lowest home point differential (total points scored by home teams minus total points scored by away teams) and the 4th lowest home win percentage since 1966. When examining past seasons, another year jumps out. Just last year in the 2019 season, home teams were outscored by away teams for the first time since 1968. The plots below show the win rates and point differentials for the past 21 regular seasons with games at neutral fields, e.g. international games, removed.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/home_field_files/figure-html/eda-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The above are essentially just averages without controlling for team strength. Maybe 2020 and 2019 had unusual schedules that consistently placed lopsided match-ups with the favorite at home. Maybe teams with big home-field advantages and played strong opponents in 2020 and 2019. The rest of this post will investigate whether these two concerns to determine whether the results change when accounting for team strength and heterogeneity in home-field advantage across teams.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;adjusting-for-team-strength&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adjusting for team strength&lt;/h2&gt;
&lt;p&gt;In this section, I will specify a model for measuring home-field advantage while accounting for team strength and analyze home-field advantage over time, both in terms of points and win probability. I will focus primarily on the home team point differential, home team points minus away team points, for each game. This outcome captures what most people care about, the winner and loser, and contains more information than just modeling wins and losses directly. A team that consistently wins by 14 points is probably better than a team that wins by only 7. Using points also avoids having to deal with ties or 16-0 and 0-16 seasons, which pose some technical difficulties that I will explain later.&lt;/p&gt;
&lt;p&gt;The model that I will use for point differential is &lt;span class=&#34;math inline&#34;&gt;\(Y_{hag} = \alpha_0 + \mu_h - \mu_a + \epsilon_{hag}\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(\epsilon_{hag} \sim N(0,\sigma^2)\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(Y_{hag}\)&lt;/span&gt; is the score of home team &lt;span class=&#34;math inline&#34;&gt;\(h\)&lt;/span&gt; minus the score of away team &lt;span class=&#34;math inline&#34;&gt;\(a\)&lt;/span&gt; in game &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(Y_{hag}\)&lt;/span&gt; is &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0 + \mu_h - \mu_a\)&lt;/span&gt; plus some noise, where &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt; captures the home team&#39;s scoring advantage, &lt;span class=&#34;math inline&#34;&gt;\(\mu_h\)&lt;/span&gt; measures the home team&#39;s strength and &lt;span class=&#34;math inline&#34;&gt;\(\mu_a\)&lt;/span&gt; the away team&#39;s strength. The model has some nice interpretations of the parameters. &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt; is the expected point differential when two equally skilled teams play. &lt;span class=&#34;math inline&#34;&gt;\(\mu_h - \mu_a\)&lt;/span&gt; is the expected number of points &lt;span class=&#34;math inline&#34;&gt;\(h\)&lt;/span&gt; will win or lose by when playing &lt;span class=&#34;math inline&#34;&gt;\(a\)&lt;/span&gt; on a neutral field. Note that this interpretation is only for the difference in team strength. Each game outcome only provides insight on the &lt;em&gt;relative&lt;/em&gt; strength of the teams, so the values of &lt;span class=&#34;math inline&#34;&gt;\(\mu_h\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\mu_a\)&lt;/span&gt; are not identified, only the differences.&lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; One could, of course, model home points and away points, with separate offensive and defensive HFAs. However, modeling positive, bivariate outcomes gets much more complicated and the primary question of measuring total home-field advantage would ultimately result in estimating a quantity very similar to &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;I will also look at simple wins and losses rather than scores to measure home-field advantage. In this case, let the outcome &lt;span class=&#34;math inline&#34;&gt;\(W_{hag}\)&lt;/span&gt; be a binary variable, 1 for if the home team wins and 0 otherwise. Let &lt;span class=&#34;math inline&#34;&gt;\(p_{hag} = P(W_{hag} = 1)\)&lt;/span&gt;, the probability of a home team win. The model is &lt;span class=&#34;math inline&#34;&gt;\(logit(p) = log\left(\frac{p}{1-p}\right) = \alpha_0 + \mu_h - \mu_a\)&lt;/span&gt;, where &lt;span class=&#34;math inline&#34;&gt;\(logit(p)\)&lt;/span&gt; refers to the log-odds of the home team winning. The same sort of interpretation applies, just with some slight transformations to account for the fact that the parameters can be any real number while &lt;span class=&#34;math inline&#34;&gt;\(p_{hag}\)&lt;/span&gt; needs to be between 0 and 1. &lt;span class=&#34;math inline&#34;&gt;\(e^{\alpha_0}\)&lt;/span&gt; is the odds of the home team winning given equal skill. &lt;span class=&#34;math inline&#34;&gt;\(e^{\mu_h - \mu_a}\)&lt;/span&gt; is the odds that team &lt;span class=&#34;math inline&#34;&gt;\(h\)&lt;/span&gt; beats team &lt;span class=&#34;math inline&#34;&gt;\(a\)&lt;/span&gt; on a neutral field. Note the same identification problem arises in the scale of &lt;span class=&#34;math inline&#34;&gt;\(\mu_i\)&lt;/span&gt;.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Both models might not look it, but they are actually linear models (or generalized linear for win probabilities) that can be estimated with standard regression packages, just with a clever design matrix. To address the identifiability issues, I enforce a constraint that &lt;span class=&#34;math inline&#34;&gt;\(\sum_i \mu_i = 0\)&lt;/span&gt;. Both models, points and wins, are essentially extensions of the Bradley-Terry model for paired comparisons. See the next section (Implementation Details) for said implementation details. Data, along with team logos and colors used later, were collected from the &lt;code&gt;nflfastR&lt;/code&gt; package by Sebastian Carl and Ben Baldwin.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/home_field_files/figure-html/plot_results-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The left panel shows the estimated expected points advantage for the home teams in each season given an opponent of equal strength. The error bars show the 95% confidence intervals. For all but two seasons prior to 2019, we can reject the null that the home-field points advantage is 0. 2019 and 2020 have estimated home-field advantages of -0.075 and 0.114 respectively. Failing to reject the null is of course different from concluding the null is true, but it would be quite challenging to get point estimates closer to zero than what we observe.&lt;/p&gt;
&lt;p&gt;In terms of home win probabilities, the results are similar. When two equally skilled teams play, prior to 2019, the home team has an about 60% chance to win. In 2019 and 2020, the confidence intervals exclude 60% for the first time and the home team only had about a 52% and 51% chance of winning, both statistically indistinguishable from a 50-50 chance.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;implementation-details&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Implementation Details&lt;/h2&gt;
&lt;p&gt;If you just want to know which teams have the biggest home-field advantages, go ahead and skip to the next section. I assume if you are still here that you have familiarity with multivariate regression and using statistical packages to implement them. I found estimating the previous section&#39;s model non-trivial, interesting, and a useful learning experience for dealing with complicated contrasts in OLS and GLM scenarios, so I thought it could help others to document it here.&lt;/p&gt;
&lt;p&gt;Here I will describe how I actually implemented the models, focusing on the points model but both are essentially the same. The statement &lt;span class=&#34;math inline&#34;&gt;\(Y_{hag} \sim N(\alpha_0 + \mu_h - \mu_a,\sigma^2)\)&lt;/span&gt; nicely expresses the model, but it does not look like a standard linear model of the form &lt;span class=&#34;math inline&#34;&gt;\(Y = \beta_0 + x_1 \beta_1 + \epsilon\)&lt;/span&gt; that most statistical packages request, mostly because of the minus sign and the fact that teams switch sides, playing both home and away games. To use standard estimation tools, essentially, for each game &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt;, we need to come up with a vector of variables &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{x}_g\)&lt;/span&gt; such that estimated coefficients &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{\beta}\)&lt;/span&gt; simplify to: &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{\beta}^T \mathbf{x}_g = \beta_1 x_{g1} + \beta_2 x_{g2} + \ldots = \alpha_0 + \mu_h - \mu_a\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Getting &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt; is easy, make &lt;span class=&#34;math inline&#34;&gt;\(x_{g1} = 1\)&lt;/span&gt; for every game and we get the intercept. Most stats packages don&#39;t actually require you to specify the intercept because it is usually not interpretable, but it is our main variable of interest here. Then, for each team &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; let &lt;span class=&#34;math inline&#34;&gt;\(z_{gi} = 1\)&lt;/span&gt; if team &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; is the home team in game &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(-1\)&lt;/span&gt; if &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; is the away team, and 0 otherwise. If we stack these &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; variables together, &lt;span class=&#34;math inline&#34;&gt;\(\mu_1 z_{g1} + \mu_2 z_{g2} + \ldots + \mu_{32} z_{g32} = \mu_h - \mu_a\)&lt;/span&gt;. But we aren&#39;t quite done, as you&#39;ll notice I&#39;ve called these variables &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; and not the &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; that we are interested in. If you try to estimate an intercept plus 32 (one for each NFL team) strength variables, &lt;span class=&#34;math inline&#34;&gt;\(z_{32}\)&lt;/span&gt; won&#39;t estimate. That&#39;s because if you add up &lt;span class=&#34;math inline&#34;&gt;\(z_1\)&lt;/span&gt; through &lt;span class=&#34;math inline&#34;&gt;\(z_{31}\)&lt;/span&gt; then multiply by -1, you&#39;ll get &lt;span class=&#34;math inline&#34;&gt;\(z_{32}\)&lt;/span&gt; (proof left as an exercise for the reader). This is essentially the identification problem coming back up. Just dropping one of the team strength variables doesn&#39;t quite work because the intercept becomes the expected point differential of whatever team was dropped playing at home against a team of strength 0. The dropping uses the identification constraint &lt;span class=&#34;math inline&#34;&gt;\(\mu_{32} = 0\)&lt;/span&gt; and treats team 32 as the &amp;quot;baseline&amp;quot; team.&lt;/p&gt;
&lt;p&gt;The constraint we really want is not for one of the team strength variables to be zero, but rather for them to sum to zero. There is no nice way to tell the software this information because the home and away team information is stored in two different columns of the dataset. So, we have to do it ourselves. In terms of the coefficients, we know &lt;span class=&#34;math inline&#34;&gt;\(\sum_{i=1}^{32} \mu_i = 0\)&lt;/span&gt; so we can write &lt;span class=&#34;math inline&#34;&gt;\(\sum_{i=1}^{31} = -\mu_{32}\)&lt;/span&gt;. Thus, we can express the regression equation as:&lt;/p&gt;
&lt;span class=&#34;math display&#34;&gt;\[\begin{align*}
E[Y_g] &amp;amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i z_{i} + \mu_{32} z_{32} \\
&amp;amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i z_{i} - \left(\sum_{i=1}^{31} \mu_i\right) z_{32} \\
&amp;amp;= \alpha_0 + \sum_{i=1}^{31} \mu_i (z_{i} - z_{32}) \\
\end{align*}\]&lt;/span&gt;
&lt;p&gt;And here we have the final result! Set the variables &lt;span class=&#34;math inline&#34;&gt;\(x_{1} = 1\)&lt;/span&gt; for the intercept and &lt;span class=&#34;math inline&#34;&gt;\(x_{i+1} = z_i - z_{32}\)&lt;/span&gt; for &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; in 1 to 31 (number of teams &lt;span class=&#34;math inline&#34;&gt;\(-1\)&lt;/span&gt;), and you have the full equation. We&#39;ve used the desired constraint to augment the trinary team indicator variables such that the intercept directly measures our quantity of interest. To get the team strength estimates, &lt;span class=&#34;math inline&#34;&gt;\(\mu_i\)&lt;/span&gt; is the regression coefficient on &lt;span class=&#34;math inline&#34;&gt;\(z_i-z_{32}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\mu_{32}\)&lt;/span&gt; is the opposite of the sum of the other &lt;span class=&#34;math inline&#34;&gt;\(\mu_i\)&lt;/span&gt; terms (ensuring they all sum to zero). Thinking about uncertainty in the estimates of &lt;span class=&#34;math inline&#34;&gt;\(\mu_i\)&lt;/span&gt; terms is hard because they are all related to each other, if one is an overestimate another must be an underestimate. Dealing with those is outside of the scope of this post, but I may return to it later.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;home-field-advantage-by-team&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Home-Field Advantage by Team&lt;/h2&gt;
&lt;p&gt;Finally, I&#39;ll look at home-field advantage by team. I&#39;ll just do this for the points model because the aforementioned issues with 16-0 and 0-16 seasons get even more common when a home undefeated or win-less season comes up. The extension to the model is simple, I just add a sub script: &lt;span class=&#34;math inline&#34;&gt;\(Y_{hag} \sim N(\alpha_h + \mu_h - \mu_a,\sigma^2)\)&lt;/span&gt;. Now I&#39;ve written &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; not &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt; to note that the home-field advantage is home team specific. Estimating this parameter is, however, a bit trickier. In the last model, &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt; was informed directly by all 256 games in a season. Each &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; is informed by only 8. We can use a similar implementation as above to get a best-guess of each parameter (the maximum likelihood estimate), but those estimates will be quite noisy. Consequently, I will put my Bayesian hat back on and use a hierarchical model for the home team advantage terms: &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h \sim N(\alpha_0,\sigma_\alpha^2)\)&lt;/span&gt;. I assume that the &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; terms all are drawn from a common distribution. This shrinks the estimates each season towards the &amp;quot;typical&amp;quot; home-field advantage and gives some slight regularization so we done make too extreme estimates given limited data. The model is implemented in stan, and you can find the code over on my &lt;a href=&#34;https://github.com/g-tierney/NFL_HFA&#34;&gt;GitHub page here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;From the model, I recover posterior beliefs about &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\)&lt;/span&gt;, the home-field advantage of a typical team in each season, and each &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt;, the home-field advantage for team &lt;span class=&#34;math inline&#34;&gt;\(h\)&lt;/span&gt; in a season. Another important (and new) variable is &lt;span class=&#34;math inline&#34;&gt;\(\sigma_{\alpha}\)&lt;/span&gt;. This is the standard deviation of home-field advantage across teams in a given season. Standard deviations are easy to interpret uncertainty measures: about 50% of the actual &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; values will be in the interval &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\pm\sigma_\alpha\)&lt;/span&gt; and nearly all of the values will be within &lt;span class=&#34;math inline&#34;&gt;\(\alpha_0\pm 2\sigma_\alpha\)&lt;/span&gt; (about 1 or 2 &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; values will fall outside of that interval each season). The figure below shows those results.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/home_field_files/figure-html/dist_results-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The Typical HFA measures much an average team would be favored at home when playing an equally skilled opponent. These results track with the above results assuming a constant home-field advantage across teams, but for some years the error bars have gotten wider. Certain years, such as 2003 and 2008, had much more variable home-field advantages, which will increase uncertainty in the behavior of an average team. 2008 had most estimates ranging from the home team being favored by about 6 points to being two point underdogs. The next two plots break out the estimates by team. The top panel shows every team and the bottom just the largest and smallest home-field advantages.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/home_field_files/figure-html/team_results-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;img src=&#34;https://g-tierney.github.io/post/home_field_files/figure-html/team_results-2.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The top plot shows the wide range of home advantages even within a single year, with lines connecting each team&#39;s estimate. Starting around 2015, that variability drops off and teams all start to look very similar to each other. Most of the time, the worst home-field advantage is about 0. The outlier in 2008 was the Detroit Lions, who were expected to lose by 5 points at home when playing that they would tie on a neutral field. This was of course the season the Lions went 0-16, losing home games by a significantly larger margin than away games. The bottom plot just picks out the best and worst teams. There is significant turnover year-to-year in the NFL and that pattern continues into home-field advantage. The year prior to the Lions&#39; historically bad year, they had the largest home-field advantage. The 9ers, Dolphins, Jaguars, Panthers, and Steelers also had the smallest and largest home-field advantages in different years, although none of them managed it in consecutive years. To try and review all teams, the table below reports the average and standard deviation of &lt;span class=&#34;math inline&#34;&gt;\(\alpha_h\)&lt;/span&gt; across all seasons for each team.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Team
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
HFA (Mean)
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
HFA (SD)
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/bal.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
BAL
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
3.08
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.57
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/gb.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
GB
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.93
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.71
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/ne.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NE
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.90
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.60
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/sea.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
SEA
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.79
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.47
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/pit.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
PIT
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.62
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/ind.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
IND
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.59
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.19
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/dal.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
DAL
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.59
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.62
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/min.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MIN
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.58
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.59
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/lar.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
LA
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.47
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.87
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/sf.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
SF
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.47
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.76
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/no.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NO
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.47
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.19
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/phi.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
PHI
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.38
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.49
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/buf.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
BUF
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.38
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.55
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/lac.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
LAC
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.44
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/den.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
DEN
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.34
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/kc.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
KC
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.32
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.90
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/ten.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
TEN
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.30
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/hou.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
HOU
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.27
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.17
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/chi.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
CHI
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.25
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.25
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/ari.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
ARI
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.23
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.64
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/tb.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
TB
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.16
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/jax.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
JAX
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.14
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.26
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500-dark/car.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
CAR
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.12
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.91
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/nyj.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NYJ
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.07
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.25
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/atl.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
ATL
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.02
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.39
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/mia.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MIA
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.02
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.40
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/cin.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
CIN
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.96
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/det.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
DET
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.92
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2.28
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/lv.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
LV
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.79
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.37
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/nyg.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NYG
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.75
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.47
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/cle.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
CLE
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.94
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
&lt;img src=&#34;https://a.espncdn.com/i/teamlogos/nfl/500/wsh.png&#34; width=&#34;30&#34; /&gt;
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
WAS
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.59
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.21
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The Ravens, Packers, Patriots, and Seahawks have the largest average home-field advantages at around 3 to 2.75 point favorites at home. The Lion were only the 5th lowest on average, despite their 2008 results. Washington, the Browns, and the Giants are the three worst home teams. The Jets, who play at the same home stadium as the Giants, are about 2 points better at home than away while the Giants are 1.76 points better. Surprisingly, each team has basically the same standard deviation of home-field advantage at around 2.5 points.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I set out to try and see if the evaporation of home-field advantage in 2020 looked different enough from previous years to claim that the COVID-19 fan and travel restrictions might be a cause of the decline. I found that, yes, home-field advantage was extremely small to non-existent in 2020, but that drop also happened last year in 2019. A few articles discussed it in 2019, but the lack of home-field advantage was discussed much more this year in the context of the global pandemic. However, I don&#39;t think the pandemic can be blamed. In recent years, home-field advantage has been very similar across teams, and it dropped to essentially zero last year before the pandemic. Certainly my model could be improved, maybe week 17 games where starters are resting should be dropped, maybe garbage time scores should be removed too, and certainly team strength varies over the course of a season. But I suspect even with more robustness checks and sophisticated tools, the result will remain the same. Other good work on home-field advantage using different methods came to essentially the same conclusions.&lt;a href=&#34;#fn4&#34; class=&#34;footnoteRef&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; The home-field advantage disappeared last year, before anyone had heard of COVID-19.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Three 49ers games were moved to a neutral field, the Cardinal&#39;s home stadium, due to COVID-19 restrictions.&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;This kind of problem would still hold if one modeled home and away scores separately, rather than just the difference. The home team&#39;s score only provides information in the home offense relative to the away defense.&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;Regular season football games can end in ties. They are rare enough that I chose to simply code ties as home team losses. Dropping them or changing them to home team wins do not meaningfully change the results because there are so few (10 out of 5,778 games since 1999). Teams with &amp;quot;perfect&amp;quot; records of 16-0 or 0-16 pose challenges as well. The MLE for their skill is &lt;span class=&#34;math inline&#34;&gt;\(\pm \infty\)&lt;/span&gt; because they always win or always lose. This is mostly an issue for interpretation of the team strength variables, but it does make other estimates a bit unstable as well.&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;A recent post by &lt;a href=&#34;https://www.opensourcefootball.com/posts/2021-01-11-hfa-analysis/#adjusting-home-field-advantage&#34;&gt;Adrian Cadena on Open Source Football&lt;/a&gt; gives a good overview and similar analysis.&lt;a href=&#34;#fnref4&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Why you shouldn’t “hold-out” data in survival-model predictions</title>
      <link>https://g-tierney.github.io/post/survival_hold_out_writeup/</link>
      <pubDate>Tue, 08 Dec 2020 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/survival_hold_out_writeup/</guid>
      <description>
&lt;link href=&#34;https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/anchor-sections/anchor-sections.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In nearly all cases, the proper way to make predictions on a subset of your data is by holding-out the data you want to predict, training a model on the remaining data, then predicting the outcome on the held-out data using the trained model. The reason is that this procedure ostensibly captures how you would use this model in practice: train the model on all the data you have, then predict for new data where the outcome is unknown. Cross-validation follows this procedure as well. However, that logic (slightly) broke down for an assignment in a class I TA&#39;ed this semester. The confusion was common enough that I thought it warranted some deeper explanation. This post summarizes an answer I gave during office hours and assumes an advanced undergraduate level of statistics background, along with familiarity with Bayesian statistics.&lt;/p&gt;
&lt;p&gt;Suppose you are modeling the lifespan of world leaders. You are given a dataset of Popes, US Presidents, Dali Lamas, Japanese Emperors, and Chinese Emperors. The data include various demographic data: how long they lived, the age and year they assumed office, position held, they year they died, and if they are currently living. The task given to students was to predict how much longer the currently living leaders would survive (5 Presidents, 2 Japanese Emperors, 2 Popes, and 1 Dalai Lama).&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; Should you train a model on the deceased leaders, then predict for lifespan for the living leaders? Many students took this approach. The answer, as you can surmise from the fact that this post exists, is no. You can, and should, train a lifespan model using the data from living leaders as well.&lt;/p&gt;
&lt;p&gt;But first, some notation. In the traditional hold-out method, you pretend you do not know the outcome &lt;span class=&#34;math inline&#34;&gt;\(Y_i\)&lt;/span&gt; for some data &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; in your hold-out set and you predict that &lt;span class=&#34;math inline&#34;&gt;\(Y_i\)&lt;/span&gt; using &lt;span class=&#34;math inline&#34;&gt;\(X_i\)&lt;/span&gt;, covariate information on unit &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\{Y_j,X_j\}\)&lt;/span&gt; for &lt;span class=&#34;math inline&#34;&gt;\(j\)&lt;/span&gt; in the observed or training data. That is, you build a model that estimates &lt;span class=&#34;math inline&#34;&gt;\(Y_j\)&lt;/span&gt; given data &lt;span class=&#34;math inline&#34;&gt;\(X_j\)&lt;/span&gt;, then apply that model to the hold-out data &lt;span class=&#34;math inline&#34;&gt;\(X_i\)&lt;/span&gt; to get an estimate of &lt;span class=&#34;math inline&#34;&gt;\(Y_i\)&lt;/span&gt;. I will refer to the set of all fully observed data &lt;span class=&#34;math inline&#34;&gt;\(\{Y_j,X_j\}\)&lt;/span&gt; as &lt;span class=&#34;math inline&#34;&gt;\(Y^{obs}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;To make this a little more concrete, suppose you believe that lifespan for leaders follows a log-normal distribution, such that &lt;span class=&#34;math inline&#34;&gt;\(log(Y_i) \sim N(\beta_0 + \beta_1 X_{i1} + \ldots,\sigma^2)\)&lt;/span&gt;. That is, the mean is a linear function of the predictors with a common variance term.&lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; The form of the model is not particularly important here, just that it has some sort of structure. If we know the parameters &lt;span class=&#34;math inline&#34;&gt;\(\beta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\sigma^2\)&lt;/span&gt;, then we wouldn&#39;t need any training data at all. We know the underlying process and can simply predict lifespans for living leaders using the parameters.&lt;/p&gt;
&lt;p&gt;Of course, we don&#39;t know the parameters. But we can learn the parameters from the training data and use them to predict the outcome. In Bayesian statistics this quantity is called the posterior-predictive distribution. We are interested in describing &lt;span class=&#34;math inline&#34;&gt;\(p(Y_i|X_i,Y^{obs})\)&lt;/span&gt;, our beliefs or uncertainty about &lt;span class=&#34;math inline&#34;&gt;\(Y_i\)&lt;/span&gt; from the hold-out set given our observed data &lt;span class=&#34;math inline&#34;&gt;\(Y^{obs}\)&lt;/span&gt;. Omitting &lt;span class=&#34;math inline&#34;&gt;\(X_i\)&lt;/span&gt; for clarity, this quantity can be analytically expressed as the following:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(Y_i|Y^{obs}) = \int p(Y_i|\beta,\sigma^2) p(\beta,\sigma^2|Y^{obs}) \ d\beta d\sigma^2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Essentially, &lt;span class=&#34;math inline&#34;&gt;\(p(Y_i|Y^{obs})\)&lt;/span&gt; is a weighted average of the assumed distribution, in this case log-normal, over the parameter space with parameter weights determined by their posterior density. &lt;span class=&#34;math inline&#34;&gt;\(p(\beta,\sigma^2|Y^{obs})\)&lt;/span&gt; is the posterior distribution for the parameters given only the training data. Given samples from the posterior, one can sample from &lt;span class=&#34;math inline&#34;&gt;\(p(Y_i|\beta,\sigma^2)\)&lt;/span&gt; to approximate the posterior predictive distribution.&lt;/p&gt;
&lt;p&gt;If you know &lt;em&gt;nothing&lt;/em&gt; about &lt;span class=&#34;math inline&#34;&gt;\(Y_i\)&lt;/span&gt; then the hold-out method is correct and really the only option. You can&#39;t learn from observations &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; where you don&#39;t know anything about the outcome.&lt;/p&gt;
&lt;p&gt;However, for survival data we do know &lt;em&gt;something&lt;/em&gt; about the outcome. We know living leaders will live to at least their current age. The you really want to estimate &lt;span class=&#34;math inline&#34;&gt;\(p(Y_i|X_i,Y^{obs},\mathbf{Y_i &amp;gt;c_i})\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(c_i\)&lt;/span&gt; is the living leader&#39;s current age. You wouldn&#39;t want to predict Jimmy Carter would only live to be 94 because he is currently 96. The expression from above becomes the following:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(Y_i|Y^{obs},Y_i&amp;gt;c_i) = \int p(Y_i|\beta,\sigma^2,Y_i&amp;gt;c_i) p(\beta,\sigma^2|Y^{obs},Y_i&amp;gt;c_i) \ d\beta d\sigma^2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The key difference is the term &lt;span class=&#34;math inline&#34;&gt;\(p(\beta,\sigma^2|Y^{obs},Y_i&amp;gt;c_i)\)&lt;/span&gt;. This is still a posterior distribution but it is not the same posterior distribution as before because it includes the information on additional leaders who have lived at least &lt;span class=&#34;math inline&#34;&gt;\(c_i\)&lt;/span&gt; years. There are five currently-living Presidents and the fact that they have reached their current ages should inform your beliefs about word leader life expectancy. If you use the hold-out method, you might predict a currently-living leader will die in the past, which is obviously wrong. If you simply force your predictions to predict time-of-deaths in the future, then you have trained your model on incomplete data and used the wrong posterior distribution. You modeled &lt;span class=&#34;math inline&#34;&gt;\(Y_i|Y^{obs},Y_i&amp;gt;c_i\)&lt;/span&gt; but learned your parameters &lt;span class=&#34;math inline&#34;&gt;\(\beta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\sigma^2\)&lt;/span&gt; only from &lt;span class=&#34;math inline&#34;&gt;\(Y^{obs}\)&lt;/span&gt; rather than &lt;span class=&#34;math inline&#34;&gt;\(Y^{obs}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(Y_i&amp;gt;c_i\)&lt;/span&gt;. I&#39;ve used Bayesian formulations here because they provide nice ways to estimate survival models and make the distinction between the input data clear, but the logic applies to any estimation of future event times.&lt;/p&gt;
&lt;p&gt;Hold-out predictions and cross-validation procedures can be deceptively complex. Your predictive model should replicate how you will actually use it in practice. If you want to predict event times in the future, you should include in your model that those events have not happened yet. Including that information can be hard and may require a more complex estimation of the parameters given the data because the likelihood is now a product of densities &lt;span class=&#34;math inline&#34;&gt;\(p(Y_j)\)&lt;/span&gt; and survival functions &lt;span class=&#34;math inline&#34;&gt;\(P(Y_i&amp;gt;c_i)\)&lt;/span&gt;.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;  But it is certainly the “correct” way to do it because it includes all of the data currently available.&lt;/p&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;The actual assignment had more components and is available &lt;a href=&#34;https://amy-herring.github.io/STA440/leaders.html&#34;&gt;here&lt;/a&gt;.&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;The students were tasked with using a more complicated model that is generally better for survival analysis but too complicated for exposition here. The assignment was based on expanding this paper: Stander, J., Dalla Valle, L., and Cortina-Borja, M. (2018). A Bayesian Survival Analysis of a Historical Dataset: How Long Do Popes Live? The American Statistician 72(4):368-375.&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;Of course, if you are a Bayesian, that combination is trivial.&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What About the Emails?</title>
      <link>https://g-tierney.github.io/post/political_emails/</link>
      <pubDate>Fri, 07 Dec 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/political_emails/</guid>
      <description>
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;the-project&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Project&lt;/h1&gt;
&lt;p&gt;One fateful day while I was bored during a lecture, I decided to sign up for email messages from each Senate campaign during the 2018 election cycle.&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; A lot of people study the impact of political advertisements on various outcomes, and I thought some interesting trends might emerge in the email blasts that campaigns send out.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Data&lt;/h1&gt;
&lt;p&gt;I signed up using a new email address, and only filled out the required fields to get on mailing lists. I used my real name, the zip code 00000, and a phone number of all zeros. The data for what information I gave to each campaign is on Github. I felt kind of bad signing up for volunteer lists with false information, which were the only email option for some campaigns. The data turned out to be pretty interesting though, so I think next cycle for the presidential election I will try to get on a more comprehensive set of emails by signing up for the House races too and providing zip codes and phone numbers in the relevant district.&lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I found the candidates and campaign websites from &lt;a href=&#34;https://www.realclearpolitics.com/epolls/2018/senate/2018_elections_senate_map.html&#34;&gt;RealClearPolitics’s Senate map&lt;/a&gt;. I started signing up for emails on 6/6/2018, but didn’t sign up for every senate race until 10/9/2018. In the last month before the election (October 6 through November 6 inclusive) I received 2,650 emails from 50 unique campaigns. Some campaigns did not have an option to signup for an email list on their website and some may have filtered out my email address because the zip code and/or phone number were clearly not accurate. Much of the analysis below compares emails from Democrats and Republicans, so I further filter the emails down to races where I received at least one email from both party’s candidates. The final number is 2,397 emails and 44 campaigns. Most analysis uses this sample, and I will specify when that is not the case.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;who-is-sending-emails-and-when&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Who is sending emails and when?&lt;/h1&gt;
&lt;p&gt;I received emails from both parties in the following states: AZ, FL, IN, MA, MD, MI, MN, MO, MS, ND, NE, NJ, NV, NY, OH, PA, TN, TX, USA, UT, VA, VT, WA, WY. A notable omission is West Virginia, where I only received emails from Joe Manchin. Below I show the number of emails I received each day from each party.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/who_emails-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;What immediately jumped out to me was that the democratic candidates send significantly more emails (and for some reason campaigns send the fewest emails on Wednesdays). Next, I tally the number of emails I received from each candidate, and show the races where I received at least 100 emails in total.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
State
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Party
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Campaign
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Total Emails
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NV
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Jacky Rosen
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
352
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NV
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Dean Heller
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
147
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FL
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Bill Nelson
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
237
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FL
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Rick Scott
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MO
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Claire McCaskill
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
185
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MO
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Josh Hawley
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
30
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
ND
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Heidi Heitkamp
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
176
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
ND
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Kevin Cramer
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
29
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AZ
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Kyrsten Sinema
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
96
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AZ
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Martha McSally
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
69
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
IN
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Joe Donnelly
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
106
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
IN
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mike Braun
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
USA
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
DNC
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
59
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
USA
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
RNC
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
73
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MN
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Amy Klobuchar
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
19
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MN
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
D
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Tina Smith
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
62
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MN
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
R
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Karin Housely
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
38
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Jacky Rosen, Bill Nelson, Claire McCaskill, Heidi Heitkamp, and Joe Donnely (all democrats) sent over 100 emails during the relevant time-frame. Dean Heller was the only republican who sent me over 100 emails. In general, Democrats sent more emails than their Republican opponents. However, I certainly would not be surprised if my sample was biased. People who signed up with in-state addresses probably received more emails than I did. I don’t know how sophisticated campaigns are with targeting their emails, but I would be shocked if they did not focus efforts like Get Out the Vote campaigns on people with addresses in their district. I do wonder, though, if there is a connection between the fundraising strategies and email strategies of each party. If Democrats rely more on smaller donations from many individuals, they might need to send more emails to everyone who expresses interest in their campaign. Republicans who either self-fund or court fewer donations from wealthier individuals might simply not have much to gain from emailing out-of-state individuals.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;email-content&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Email Content&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/word_clouds-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Next, I will analyze the content of the emails. The word clouds above show each word sized proportionally to the amount of times it was used in the email. Both party’s emails frequently used words like senate, vote, and fight, but some differences are already apparent. Here I do unfortunately need to filter the dataset down more. Many of the emails that were sent came in a format that did not download or parse into human-readable text well. I tried to extract the text from all emails, but some (particularly ones with odd formatting or with pictures of text) I could not parse properly. The number of emails analyzed in this section is only 2,043.&lt;/p&gt;
&lt;p&gt;Word clouds are a useful visualization, but I will use a statistical technique, the relative risk ratio, to characterize the difference in word-usage between Republican and Democratic emails.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/word_counts-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The chart above merits some further explanation. For each word, I calculated the proportion of Democratic emails that used the word and the proportion of Republican emails that used the word. Then, I took the ratio of those two quantities, often referred to as the relative risk ratio. To put everything on a comparable scale, if Republican emails used the word more frequently, I multiplied the proportion by -1 and took the inverse. So if the ratio is equal to R and positive, then Democratic emails used the word R times more frequently. If the ratio is equal to R and negative, then Republican emails used the word R times more frequently. I show the 15 words with the largest ratio for each party.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Something that I noticed quickly is that election-specific terms from races where the Democrat sent many more emails than the Republican have high risk ratios: “Las Vegas”, “Scott” (Bill Nelson referring to his opponent Rick Scott), and “FL” are all in the top 15 for democrats. “ActBlue” is an organization that helps Democrats fundraise. “politicalemaild” is a truncated version of the email address I provided to campaigns.&lt;/p&gt;
&lt;p&gt;I was not surprised that the words “borders” and “conservative” are in the top for Republicans, but I was quite surprised by “web” and “website” showing up in the top 15. “Liberal” is used frequently as a pejorative by Republicans, but apparently Democrats do not use the word anywhere near as frequently when messaging their own supporters.&lt;/p&gt;
&lt;p&gt;Of course, words that are used by one party and never used by the other will have a risk ratio of plus or minus infinity. The problem with looking at all of those words is that they often are spelling or parsing errors that happen once or twice for one party and never for the other. To account for that, I show only the 10 most frequently used words that are never used by the opposing party.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;border-bottom:hidden&#34; colspan=&#34;1&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; &#34; colspan=&#34;2&#34;&gt;
&lt;div style=&#34;border-bottom: 1px solid #ddd; padding-bottom: 5px;&#34;&gt;
Democratic
&lt;/div&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; &#34; colspan=&#34;2&#34;&gt;
&lt;div style=&#34;border-bottom: 1px solid #ddd; padding-bottom: 5px;&#34;&gt;
Republican
&lt;/div&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Emails Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Proportion Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Emails Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Proportion Using Word
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
youd
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
625
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.287
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
mitch
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
307
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.141
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
mcconnell
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
299
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.138
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
environment
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
262
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.121
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
fivethirtyeight
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
224
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.103
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
silvers
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
217
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.100
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
nelson
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
212
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.098
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
melbourne
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
210
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.097
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
whitmire
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
210
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.097
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
floridas
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
198
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.091
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
inherit
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
74
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.079
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
nrcc
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
66
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.071
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
chuck
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
62
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.067
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
complaints
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
60
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.064
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
devoted
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
60
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.064
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
conservatives
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
57
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.061
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
suggestions
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
57
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.061
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
replying
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
56
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.060
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
schumer
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
54
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.058
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
charitable
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
53
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.057
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Again, “Florida’s” and “Nelson” show up as top democratic words never used by republicans because I received many emails from Bill Nelson’s campaign and almost none from Rick Scott’s. I was surprised that no Republican emails used the word “you’d.” One potential explanation is that Democrats use more personal apeals in their emails, which is further evidenced by the fact that Democrats used my email address more than republicans. Democrats referenced “Silver’s” “FiveThirtyEight” website in over 200 emails, while Republicans never mentioned either him or the website.&lt;/p&gt;
&lt;p&gt;Besides the words that are used most differently, I was also curious how some specific words showed up in emails. Below I tabulate the usage of a few select words. Specifically, I was interested in which politicians get mentioned by each party, some campaign issues, and party signifiers.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;border-bottom:hidden&#34; colspan=&#34;1&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; &#34; colspan=&#34;2&#34;&gt;
&lt;div style=&#34;border-bottom: 1px solid #ddd; padding-bottom: 5px;&#34;&gt;
Democratic
&lt;/div&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; &#34; colspan=&#34;2&#34;&gt;
&lt;div style=&#34;border-bottom: 1px solid #ddd; padding-bottom: 5px;&#34;&gt;
Republican
&lt;/div&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom:hidden&#34; colspan=&#34;1&#34;&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Emails Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Proportion Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Emails Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Proportion Using Word
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Signed Risk Ratio
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
trump
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
365
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.168
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
183
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.196
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-1.170
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
obama
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
27
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.012
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
27
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.029
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-2.333
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
pelosi
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
38
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.041
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
mcconnell
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
299
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.138
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
caravan
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
kavanaugh
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
54
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.025
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
60
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.064
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-2.592
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
radical
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.002
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
68
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.073
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-39.655
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
conservative
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.001
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
136
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.146
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-105.745
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
liberal
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.000
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
112
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.120
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-261.253
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
progressive
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.011
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
14
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.015
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
-1.306
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;While Trump and Obama were mentioned in about 30% of emails by both Democrats and Republicans, congressional leaders were almost exclusively mentioned by the opposing party. Democrats mentioned Mitch McConnell in about 16.5% of their emails, while he was never mentioned by Republicans. Republicans mentioned Nancy Pelosi in about 9% of their emails, while she was mentioned in only 0.5% of democratic emails. This seems to lend truth to the impression I got from news coverage that Republicans ran against Pelosi and Democrats ran against McConnell.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;email-sentiment&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Email Sentiment&lt;/h1&gt;
&lt;p&gt;How did the sentiment of emails differ by party? Here I will rely on some of the built-in dictionaries in the &lt;code&gt;tidytext&lt;/code&gt; package in R. Sentiment analysis generally refers to matching each word with some sentiment, either on a positive or negative scale or to some (short) list of topics. A review of the dictionaries that I will be using is &lt;a href=&#34;https://www.tidytextmining.com/sentiment.html#the-sentiments-dataset&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The simplest way to find the sentiment of a document is to use a dictionary that maps words to whether they express a positive or negative sentiment, give each positive word a weight of +1 and each negative word a weight of -1, then find the average for each document. The chart below does just that. Note that the bounds are at plus or minus 1 because I only take the average over words with a sentiment attached to them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/pos_neg_sentiments-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I plotted a density histogram because I received many more democratic than republican emails. The total area of the bars is scaled to sum to one for each party. The results look pretty interesting. For both parties, the distribution of sentiment is concentrated at 0.5, which corresponds to emails that use 75% positive words and 25% negative words. Democratic emails are more concentrated at this value, and the sentiment of Republican emails is more spread out with additional concentrations at plus or minus 1 (all positive or all negative words).&lt;/p&gt;
&lt;p&gt;Next I try a slightly more complicated sentiment analysis. I use a dictionary that maps words to (potentially multiple) of the following sentiments: trust, fear, negative, sadness, anger, surprise, positive, disgust, joy, and anticipation. Then, I calculate the proportion of words with the given sentiment among words that had any sentiment for each email and plot a histogram of the results. There are a lot of charts below, but I think they all tell the same story. The overall sentiment of the words used by each party are extremely similar. If all you knew about a campaign email was the sentiment score reported below, it would be extremely hard to correctly guess the party of the candidate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/categorical_sentiment-1.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/categorical_sentiment-2.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/categorical_sentiment-3.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/categorical_sentiment-4.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;img src=&#34;https://g-tierney.github.io/post/political_emails_files/figure-html/categorical_sentiment-5.png&#34; width=&#34;672&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;classification&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Classification&lt;/h1&gt;
&lt;p&gt;Overall, it looks like certain words can give away the political alignment of the email, but the sentiment cannot. However, that conclusion just comes from examining the histograms. A more sophisticated method to measure how well sentiment can identify political text is to fit a prediction model, then test how well it works on a held-out set. If the information about sentiment improves predictive power, then there is evidence that the parties speak to their supporters differently. A recent working paper by Gentzcow, Shapiro, and Taddy measure political polarization by how accurately speeches on the floor of Congress can identify the partisan alignment of the speaker.&lt;a href=&#34;#fn4&#34; class=&#34;footnoteRef&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; The classification accuracy results here can be interpreted similarly. I will use the two most straightforward techniques that I know of for this task: logistic regression and leave-one-out cross validation (LOOCV).&lt;/p&gt;
&lt;p&gt;Logistic regression is a standard method to fit a model to a binary outcome. What I want to do is estimate the probability that an email is from a Democrat as a function of the sentiment scores. A linear relationship will not work because probabilities need to be between zero and one. Logistic regression assumes that the log of the odds of an event (the probability of an email being Democratic divided by the probability of the email being Republican) is a linear function of the explanatory variables. Because of this change in perspective, interpreting the coefficients of a logistic regression is a little bit harder.&lt;a href=&#34;#fn5&#34; class=&#34;footnoteRef&#34; id=&#34;fnref5&#34;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt; It is still the case, however, that positive coefficients mean an increase in the explanatory variable is associated with an increase in the probability that the dependent variable equals one, so the interpretation of the &lt;em&gt;sign&lt;/em&gt; of the coefficient remains the same.&lt;/p&gt;
&lt;p&gt;Leave-one-out cross-validation is an evaluation method where a model is fit on all data points except one, then the model is asked to predict the value at the held-out data point, and finally the prediction and observed value are compared. In my application, I code the event that an email comes from a democratic campaign as a one and the event that an email comes from a republican campaign as a zero. If the model predicts that an email is from a Democratic campaign with probability greater than 0.5, I classify it as a Democratic email and as a Republican email otherwise. The LOOCV error rate is the number of miss-classified emails divided by the number of emails.&lt;/p&gt;
&lt;p&gt;The explanatory variables I use are all the variables charted above (the average of positive and negative sentiment, called “Email Sentiment”, and each of the categorical sentiments) and the number of sentiment words in each email. To see if results were being driven by a single sentiment, I fit a univariate model with each term individually and a full model with all of the variables included. The results of the univaraite regressions are shown below.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Model
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Coefficient
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Intercept
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
LOOCV Error Rate
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Email Sentiment
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.084**&lt;br&gt;(0.03)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.663***&lt;br&gt;(0.016)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Trust
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-0.735***&lt;br&gt;(0.086)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1.018***&lt;br&gt;(0.039)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
28.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Fear
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-0.454***&lt;br&gt;(0.126)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.746***&lt;br&gt;(0.017)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.3%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Negative
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.391***&lt;br&gt;(0.097)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.596***&lt;br&gt;(0.027)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sadness
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.989***&lt;br&gt;(0.142)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.569***&lt;br&gt;(0.021)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Anger
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.023&lt;br&gt;(0.128)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.693***&lt;br&gt;(0.019)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Surprise
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.228&lt;br&gt;(0.145)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.67***&lt;br&gt;(0.019)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Positive
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-1.26***&lt;br&gt;(0.073)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1.263***&lt;br&gt;(0.034)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
27.7%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Disgust
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-0.611**&lt;br&gt;(0.187)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.73***&lt;br&gt;(0.015)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.2%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Joy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-1.692***&lt;br&gt;(0.109)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.926***&lt;br&gt;(0.018)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
27.1%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Anticipation
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-0.883***&lt;br&gt;(0.107)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.871***&lt;br&gt;(0.024)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.1%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of Sentiment Words
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0&lt;br&gt;(0.001)
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.692***&lt;br&gt;(0.024)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
30.4%
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;tfoot&gt;
&lt;tr&gt;
&lt;td style=&#34;padding: 0; border: 0;&#34; colspan=&#34;100%&#34;&gt;
&lt;span style=&#34;font-style: italic;&#34;&gt;Note: &lt;/span&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;padding: 0; border: 0;&#34; colspan=&#34;100%&#34;&gt;
&lt;sup&gt;&lt;/sup&gt; * p&amp;lt;0.05; ** p&amp;lt;0.01; *** p&amp;lt;0.001
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tfoot&gt;
&lt;/table&gt;
&lt;p&gt;While the error rates appear low, they actually are not much better than random chance. Of the 1,880 emails with sentiment data, 69.6% of them are from democratic campaigns. So if you just guessed that every email was Democratic, you would be wrong 30.4% of the time. An equivalent interpretation is that if you randomly classified 69.6% of the emails as Democratic and the rest as Republican, you would expect to have an error rate of 30.4%. Even for some of the sentiments with statistically significant coefficients, their prediction accuracy is no better than random chance. The full model, however, does have some predictive power. The results are shown below.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Variable
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Coefficient
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Email Sentiment
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
0.99***&lt;br&gt;(0.2)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Trust
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-6.511***&lt;br&gt;(0.704)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Fear
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-8.376***&lt;br&gt;(1.04)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Negative
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-2.256*&lt;br&gt;(0.945)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Sadness
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
6.867***&lt;br&gt;(1.148)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Anger
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
3.013**&lt;br&gt;(1.117)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Surprise
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1.201&lt;br&gt;(1.106)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Positive
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-7.824***&lt;br&gt;(0.68)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Disgust
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-7.099***&lt;br&gt;(1.263)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Joy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-3.985***&lt;br&gt;(1.089)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Anticipation
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-1.123&lt;br&gt;(0.773)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of Sentiment Words
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
-0.002&lt;br&gt;(0.004)
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Intercept
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
8.438***&lt;br&gt;(0.646)
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;tfoot&gt;
&lt;tr&gt;
&lt;td style=&#34;padding: 0; border: 0;&#34; colspan=&#34;100%&#34;&gt;
&lt;span style=&#34;font-style: italic;&#34;&gt;Note: &lt;/span&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;padding: 0; border: 0;&#34; colspan=&#34;100%&#34;&gt;
&lt;sup&gt;&lt;/sup&gt; * p&amp;lt;0.05; ** p&amp;lt;0.01; *** p&amp;lt;0.001. LOOCV Error Rate: 21.6%
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tfoot&gt;
&lt;/table&gt;
&lt;p&gt;The error rate for the full model is 21.6%, an improvement over random chance by a factor of about 2/3. By using all of the sentiments together, emails can be classified better than by simply guessing that every email is democratic.&lt;/p&gt;
&lt;p&gt;Emails with higher positive average sentiment, or more sadness and anger words are more likely to be democratic emails. Emails with more trust, fear, negative, positive, disgust, or joy words are more likely to be from republican candidates. Democrats might tend to describe the previous Congress with words that convey anger or sadness, while Republicans would use positive, joy, or trust words. The fact that Republican emails also use more disgust and fear words is evidence against that explanation though, or at least is evidence that the truth is more complicated.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;further-work&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Further Work&lt;/h1&gt;
&lt;p&gt;The classification and cross-validation methods that I used are certainly simpler versions of where the state of the art is. I would not be surprised if more information could be extracted from the text or the sentiments with more sophisticated tools. Something that I haven’t explored, but could be quite interesting, is using the text itself rather than the sentiment to classify emails. The difficulty is that text data is very high dimensional: there were far more unique words used than emails sent. Regression with more variables than observations tends to overfit the estimation data and produce poor out-of-sample predictions. There are also some words that are very predictive, but not predictive in an “interesting” way. Emails always mention the candidate’s name, frequently include a link to the campaign’s donation page or list their mailing address. Clearly, you can immediately know the party of the sender from those features. But those predictions are also not very interesting.&lt;/p&gt;
&lt;p&gt;With unlimited time and resources, what I would like to do is extract from the emails some measures of the general language each party uses and then use those features to classify the email. If those features classify accurately, then there is evidence for divergence in how the parties speak to their supporters. That result has implications for trying to convince people to change their political beliefs and how to talk to people from across the isle. Presenting convincing arguments to partisans of each party probably requires speaking their language in some sense, and campaign emails are potentially a useful data source for learning that language.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;This project may have been entierly inspired by the fact that I really liked the title.&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;The code and data are on Github here: &lt;a href=&#34;https://github.com/g-tierney/political_emails&#34; class=&#34;uri&#34;&gt;https://github.com/g-tierney/political_emails&lt;/a&gt;.&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;I also removed all non-alphanumeric characters from each word.&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;The full article can be found &lt;a href=&#34;https://www.nber.org/papers/w22423.pdf&#34;&gt;here&lt;/a&gt;.&lt;a href=&#34;#fnref4&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn5&#34;&gt;&lt;p&gt;Technically, a one unit increase in the explanatory variable increases the odds that an event occurs by &lt;span class=&#34;math inline&#34;&gt;\(e^\beta\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(\beta\)&lt;/span&gt; is the estimated logit coefficient.&lt;a href=&#34;#fnref5&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Reviews: The Book of Why and Bad Blood</title>
      <link>https://g-tierney.github.io/post/2018_10_books/</link>
      <pubDate>Thu, 01 Nov 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/2018_10_books/</guid>
      <description>
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;###Judea Pearl’s &lt;a href=&#34;https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X/&#34;&gt;The Book of Why: The New Science of Cause and Effect&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I was honestly surprised by how much I liked this book. It covers Pearl’s new work on causal diagrams, directed graphs where each node represents a variable and edges point in the direction of the causal effect. His claim is that these graphs make previously “squishy” arguments about identification mathematically robust, simplify many hard-to-explain causal concepts, and will be useful in building strong artificial intelligence because the graphs are machine-readable, unlike other causal identifying assumptions.&lt;/p&gt;
&lt;p&gt;For a little bit of background, usually when a researcher makes a causal claim about observational data (not a randomized control trial), they also specify some assumptions about how the variables need to relate to one another for the result to be causal not just correlational. Often these are justified by subject area knowledge or, when possible, other data analyses. Pearl’s causal diagrams, he claims, are easier to understand and interpret than written out descriptions of the model and assumptions.&lt;/p&gt;
&lt;p&gt;What I found particularly entertaining in this book were the shots Pearl took at other practitioners, notably statisticians and economists, who he claims have thrown up their hands and abandoned causation as too difficult to prove. I can certainly sympathize with this view, particularly among economists, who I think are often too reluctant to make causal claims without randomized experiments.&lt;/p&gt;
&lt;p&gt;However, throughout the book I constantly expected the next chapter to be about how to draw these diagrams. The examples he used were often related to smoking and lung cancer, where we have a very strong existing understanding of the causal mechanisms. But the diagrams seem much less helpful when you are not certain about which edges exist (and in which direction), or even what variables should be represented. The answer might delve too deep into epistemology (how do we “know” an empirical relationship exists), but it seems like a central question to the utility of these diagrams. The data analyst might rely on experts for the diagram, but I think many of the issues in causal inference come not from the ability to explain the causal assumptions but rather in justifying those assumptions.&lt;/p&gt;
&lt;p&gt;Of course if an expert hands you a list of relationships that are accepted conventional wisdom in the field, a data analyst could do the causal analysis. But that never really happens outside of a few contrived subject areas. Particularly in social sciences, almost no relationships are accepted as true. Social structures even change over time, so the fact that some correlation held in the past is not always a good justification for using it in a causal diagram about current trends. To me, its always seemed easiest to make causal inferences when you have a strong understanding of the subject area. Going from a list of beliefs to a directed graph might help formalizing the mathematical definitions of cause and effect, but it really doesn’t seem like it would help much in practice because the hard part is getting that list and quantifying uncertainty about the list.&lt;/p&gt;
&lt;p&gt;An area where I do understand how the diagrams are useful is when Pearl discussed how to teach causation to machines. A list of beliefs is quite hard for a computer to understand, but a directed graph referring to how each variable causes changes in others does sound very machine-interpret-able. The problem of generating the list still seems hard, but the notion of how do you tell a computer that X causes Y and not the reverse was pretty interesting.&lt;/p&gt;
&lt;div id=&#34;john-carreyrous-bad-blood-the-theranos-story&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;John Carreyrou’s &lt;a href=&#34;https://www.amazon.com/Bad-Blood-Secrets-Silicon-Startup-ebook/dp/B078VW3VM7/&#34;&gt;Bad Blood: The Theranos Story&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This was a really fascinating rundown of how a well-regarded start-up collapsed. The short of it is that the promises made by the founder and CEO Elizabeth Holmes about being able to run hundreds of medical tests on a single drop of blood were lies. Employees who raised concerns that the tests were not accurate or that the product just didn’t work were told that they weren’t ``team players,’’ fired, and forced to sign non-disclosure agreements. Through a mixture of deceit (showing “live” demos that were actually recordings of results), strong branding, and an idea that, if it had worked, would genuinely have been revolutionary, she convinced senior executives and officials at a variety of institutions to buy into her company and her personally. When results did not look good or deadlines were not met, the executives made excuses or accepted the delays because they believed in the grand vision of revolutionizing health care.&lt;/p&gt;
&lt;p&gt;The story is well chronicled in the book and the popular press, so I’ll just highlight one of my takeaways that I haven’t seen discussed elsewhere. A lot of the information John Carreyrou gathered would have been impossible without Theranos hiring some wealthy and well-connected Stanford graduates. The former employees who spoke to Carreyrou had to bear significant legal risk. Theranos had paid David Boise, one of the most feared attorneys in the country, with company shares and his law firm vigorously enforced the NDAs employees had signed. They hired investigators to follow ex-employees who they suspected were talking to journalists. Theranos both threatened to sue and actually sued potential whistleblowers. One of Carreyrou’s main informants was a Stanford graduate who had wealthy parents and was the grandson of a former Secretary of State. He (his parents) had the financial resources to weather the legal attacks until the story broke, while other whistleblowers (quite rationally) backed out after threats of litigation.&lt;/p&gt;
&lt;p&gt;Existing whistleblower statutes were certainly not enough to protect the former Theranos employees who had concerns. It should not be only privileged white men who can take the risk of reporting the companies that they work for.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Data Science in Mental Health</title>
      <link>https://g-tierney.github.io/post/2018_09_mental_health/</link>
      <pubDate>Thu, 20 Sep 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/2018_09_mental_health/</guid>
      <description>


&lt;p&gt;I came across two articles recently that I thought spoke to each other in an interesting way. The first was a New York Times piece about the failings of data science firms who try to identify school shootings before they happen by social media posts. The second was a Vox article about how a crisis counseling hotline successfully used data science to flag callers who are at higher risk of suicide or self-harm.&lt;/p&gt;
&lt;p&gt;I have some specific thoughts on both, but I think the comparison between the two articles shows why a data-driven approach is helpful in one case and not the other.&lt;/p&gt;
&lt;div id=&#34;a-failure-predicting-school-shootings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A Failure: Predicting School Shootings&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.nytimes.com/2018/09/06/us/social-media-monitoring-school-shootings.html&#34; target=&#34;_blank&#34;&gt;“Could Monitoring Students on Social Media Stop the Next School Shooting?” by Aaron Leibowitz&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This article reviews the services that several companies are providing to school districts by monitoring public posts by students on social media. These companies usually scrape data from all posts in a geographic region around the school. “Rather than asking schools for a list of students and social media handles, the companies typically employ a method called “geofencing” to sweep up posts within a given geographic area and use keywords to narrow the pool.”&lt;/p&gt;
&lt;p&gt;However, as you can imagine, this wide of a net ends up flagging posts from many people unaffiliated with the school. One school in Ohio was warned about someone posting “There’s three seasons: summer, construction season and school shooting season.” Further investigation discovered that the poster was from Wisconsin not Ohio. Another school that hired one of these firms was close to a liquor store, and the firm couldn’t separate tweets about the store and the school. In general, the problem seems to be that in the available data, any signals, if they exist, are swamped by the noise.&lt;/p&gt;
&lt;p&gt;This monitoring also brought up some philosophical questions about what out-of-school actions should have in-school consequences. A case of a school that hired an outside firm to review students’ posts on social media expelled 14 students. Some of the allegations are described below:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One student had been accused of “holding too much money” in photographs, an investigation by the Southern Poverty Law Center found, and one was suspended for an Instagram post in which she wore a sweatshirt with an airbrushed image of her father, a murder victim. School officials said the sweatshirt’s colors and the student’s hand symbol were evidence of gang ties, according to the investigation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can understand an administrator’s desire to punish students for some out-of-school actions, but these seem like they’ve gone too far. School officials who suddenly have access to &lt;em&gt;all&lt;/em&gt; public activity of their students need to think harder about what kind of actions they want to police. Especially if they want to continue to monitor students’ activities for more serious transgressions, officials need to be tolerant of activities they may disapprove of to keep that information channel open. I’m sure that many fewer students made any public posts after these expulsions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-success-identifying-high-risk-callers-to-a-crisis-hotline&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A Success: Identifying High-Risk Callers to a Crisis Hotline&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.vox.com/science-and-health/2018/6/8/17441452/suicide-prevention-anthony-bourdain-crisis-text-line-data-science&#34; target=&#34;_blank&#34;&gt;“How data scientists are using AI for suicide prevention” by Brian Resnick&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A more heartening case is this article about the data science team at Crisis Text Line. CTL provides crisis counseling via text message to anyone who requests it. Certain events cause dramatic increases in demand for their services; Robin Williams’s suicide and the 2015 terrorist attack in Paris are two examples. The volunteers working at the time cannot handle everyone at once, so they used machine learning to prioritize incoming requests based on the text message rather than using the order the requests came in. The words most predictive of an active rescue (when 911 is called) were the names of household drugs like Advil or Ibuprofen, even the crying face emoji was more predictive than the word “suicide” and other words like “cut” and “kill” that the company had thought would be good predictors previously.&lt;/p&gt;
&lt;p&gt;Something I wondered throughout this piece was what really was the “machine learning” being used and did it really rise to the level of artificial intelligence? It sounds like the analysis could have been simply computed by a simple logistic regression of active rescue on indicators for which words were used in the first message. It sounds a bit pedantic, but I think overuse of buzzwords like machine learning and AI discourage people who would have valuable insights or could produce data analysis of similar rigor and results from looking into these problems. For a longer read on how data is being used (and not used) in the counseling profession, check out this &lt;a href=&#34;https://www.theatlantic.com/magazine/archive/2017/04/what-your-therapist-doesnt-know/517797/&#34;&gt;Atlantic piece&lt;/a&gt; that was linked in the Vox article.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-was-the-difference&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What was the difference?&lt;/h2&gt;
&lt;p&gt;Both CTL and the school shooting programs were trying to predict individual behavior from limited textual data. I think the difference is that the event they wanted to detect was much more frequent and observable in the studied population.&lt;/p&gt;
&lt;p&gt;My intuition is that the proportion of school shooters among everyone living near a school is significantly smaller than the proportion of callers to a crisis hotline who need an active rescue. The most credible estimates are that in the 2015-2016 school year there were &lt;a href=&#34;https://theconversation.com/why-theres-so-much-inconsistency-in-school-shooting-data-102318&#34; target=&#34;_blank&#34;&gt;11 to 29&lt;/a&gt; school shootings across the country. The data on social media posts by the shooters has to be even more scarce. Statistical methods to identify which features are predictive of a given event need data from when the event does and does not occur. The school security firms, lacking many data on what shooters post before they bring a gun to school, ended up simply referring “violent-sounding” messages to school officials, without being able to specify how likely that person was to actually be violent at school. That determination was left up to school officials who, given discretion, appear to have made some poor choices about what kinds of messages merited a response. I would not be surprised if part of the justification school administrators had in mind was that the data science firm flagged the message because they thought the person would be violent at school. When in reality, the firm isn’t doing any data-based prediction on what messages correlate with actual violence.&lt;/p&gt;
&lt;p&gt;CTL, however, had a clear outcome variable and data on both callers who did and did not need active rescue. They were able to build a statistical model and identify predictive features of the incoming messages to effectively allocate resources. CTL had the appropriate labeled data, but the schools only had a limited selection of messages from students who had not yet (and might never be) violent at school.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Record linkage: An Adventure in Graph Theory</title>
      <link>https://g-tierney.github.io/post/name_graphs/</link>
      <pubDate>Sat, 25 Aug 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/name_graphs/</guid>
      <description>
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I recently encountered a problem that had a surprisingly elegant solution. I struggled a lot with solving this issue, so hopefully in writing this post I can save someone else the trouble! For reasons that are irrelevant, I wanted to track the performance of youth fencers across time. National ranking lists are posted each year, but the fencers’ names frequently change from year to year. My solution was to create a dataset of all sets of two names, mark which pairs were matches, then pick a single name to use for each set of names that matched against each other. My mistaken belief was that the second step would be the hardest, but actually the third was the most difficult.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-problem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The data I had were end-of-season rankings for youth fencers by age, gender, and weapon. The only identifier to link fencers across years was their name and year of birth. However, the names are notoriously difficult to standardize. Some years a kid goes by John, others its Jonathan, and maybe a third time its John Smith IV. Numeric suffixes seemed quite overrepresented among fencers relative to the general population. Some Asian fencers would go by a phonetic spelling of their given name in one year then use an English name in the next.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;solution-part-1-the-surprisingly-easy-part&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution Part 1: The (Surprisingly) Easy Part&lt;/h2&gt;
&lt;p&gt;I joined the list of names (along with the other less granular identifiers) to itself on the less granular identifiers to get a list of all pairs of names that had the same age, gender, and weapon. I removed pairs whose last names required more than three deletions, insertions, or single-letter transformations to match (the Levenshtein distance). I then tasked two undergraduate RAs to independently inspect each pair and mark the ones that could be the same name, then resolve any discrepancies in the matches they found.&lt;/p&gt;
&lt;p&gt;I had thought this step, reviewing each pair of names, would be the most time-consuming, but it actually was rather quick. The 11,211 were reviewed and mismatched in less than 8 hours of work per RA. A snapshot of the final dataset is shown below.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
match
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
name_key
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
name_key2
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
gender
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
weapon
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
bthyear
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
albert.park
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
albert.park
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
bin.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
royce.wang
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
mikolaj.bak
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
aaron.ahn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
eric.zhang
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Male
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1997
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;solution-part-2-the-hard-part&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution Part 2: The Hard Part&lt;/h2&gt;
&lt;p&gt;After this dataset of matches was created, I needed to identify for each name, all other names that matched with it, then pick one of those names to use as the “real” name. The operations required to do this were surprisingly challenging. Each name appeared in both name_key columns, so any grouping had to be done on two variables. It also requires a consistent operation that will select the same “real” name for each name within the matches. I was struggling to implement this solution on the rectangular dataframe. I needed to group by names in two variables, spread unique values of two variables into multiple columns, then consistently select one of those entries. Certainly, there is a way to do this, but it was not intuitive to me and I suspect it would be quite a slow operation.&lt;/p&gt;
&lt;p&gt;Eventually, I realized that instead of trying to operate on the data as a matrix with variables in columns and observations in rows, I should treat the data as a graph. Each name was a node, and edges represent names that were matched. Each connected graph within the disconnected graph of all names represented a single “real” name. To extract what I needed, I just had list each node and which graph it was in, then arbitrarily pick one node from each graph to be the “real” name.&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Fortunately, people who write much better R code than me have developed tools to operate on graphs quickly and efficiently. What took me hours to (unsuccessfully) do on a rectangular dataframe took approximately 30 minutes using graph operations. A quick visualization of the graph is below.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;graph_plot &amp;lt;- suppressWarnings(graph.data.frame(name_matches[1:100,]  %&amp;gt;% filter(match == 1) %&amp;gt;% select(name_key,name_key2),directed = F))

#remove duplicated edges
graph_plot &amp;lt;- graph_plot %&amp;gt;% simplify() 

set.seed(0515) #fix the position of nodes on the plot
ggnet2(graph_plot,label = T,layout.exp=2,color = &amp;quot;lightskyblue2&amp;quot;) + 
  ggtitle(&amp;quot;Sample Name Network&amp;quot;) + theme(plot.title = element_text(hjust = .5,face = &amp;quot;bold&amp;quot;,size = 15))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://g-tierney.github.io/post/name_graphs_files/figure-html/plot_graph-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now here is the code itself. The full script and example data are available at &lt;a href=&#34;https://github.com/g-tierney/record_linkage_graphs&#34;&gt;this Github repository&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(igraph)

#add other identifiers to node names
combine_ids &amp;lt;- function(...){
  str_c(...,sep = &amp;quot;_&amp;quot;)
}
name_matches &amp;lt;- name_matches %&amp;gt;% rowwise() %&amp;gt;% 
  mutate(name_key = combine_ids(name_key,gender,weapon,bthyear),
         name_key2 = combine_ids(name_key2,gender,weapon,bthyear)) 

#turn data into a graph
graph &amp;lt;- graph.data.frame(name_matches %&amp;gt;% filter(match == 1) %&amp;gt;% select(name_key,name_key2),directed = F)
dg &amp;lt;- decompose.graph(graph)

#list names of verticies grouped by connected graphs
name_links &amp;lt;- map(dg,function(x){V(x)$name})

#combine into a single dataframe to merge
make_df &amp;lt;- function(list_element){
  vec &amp;lt;- unlist(list_element)
  std_name &amp;lt;- str_split(vec[1],&amp;quot;_&amp;quot;,simplify = T)[1]
  data.frame(name_key_combined = vec,std_name = std_name,stringsAsFactors = F)
}
name_standardizations &amp;lt;- do.call(rbind,map(name_links,make_df))

#spread identifiers back into multiple columns
name_standardizations &amp;lt;- name_standardizations %&amp;gt;% separate(name_key_combined,into = c(&amp;quot;name_key&amp;quot;,&amp;quot;gender&amp;quot;,&amp;quot;weapon&amp;quot;,&amp;quot;bthyear&amp;quot;),sep = &amp;quot;_&amp;quot;)
name_standardizations[1:7,] %&amp;gt;% kableExtra::kable(format = &amp;quot;html&amp;quot;) %&amp;gt;% 
  kableExtra::kable_styling(bootstrap_options = &amp;quot;striped&amp;quot;,full_width = F)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
name_key
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
gender
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
weapon
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
bthyear
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
std_name
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abagael.a.buckborough
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
sabre
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1999
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abagael.a.buckborough
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abagael.r.buckborough
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
sabre
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1999
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abagael.a.buckborough
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.buckborough
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
sabre
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1999
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abagael.a.buckborough
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.emerson
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1987
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.emerson
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abigail.emerson
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1987
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.emerson
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.schifferle
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1988
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.schifferle
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abigail.schifferle
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Female
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
foil
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1988
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
abby.schifferle
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The lesson I learned from this experience was the importance of taking a step back from a difficult problem and approaching it from a different angle. Not thinking of data as a matrix of numbers was instrumental to solving this particular problem and is likely key to solving many others.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Incidentally, I think the reason I came up with this idea was I was researching professors at Ph.D. programs I was accepted to, and one of them, Rebecca Steorts, mentioned research on record linkage using graphs in her research interests.&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Article Round Up June 2018: Income Inequality, Partisan Economies, and Trump Tweets to Hate Crimes</title>
      <link>https://g-tierney.github.io/post/2018_06_best_reads/</link>
      <pubDate>Thu, 07 Jun 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/2018_06_best_reads/</guid>
      <description>


&lt;p&gt;This is my second article round up. The first is over at &lt;a href=&#34;https://ticktocksaythehandsoftheclock.wordpress.com/2018/04/06/march-2018-universities-fake-news-and-drugs/&#34;&gt;my old blog here&lt;/a&gt;. In this post, I’ll briefly cover two Atlantic articles related to economic inequality and the differing economic expectations of democrats and republicans. Most time is spent reviewing a new working paper on the correlation between Trump’s twitter activity and hate crimes, and thinking about how it could make the jump from correlation to causation.&lt;/p&gt;
&lt;div id=&#34;income-inequality&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Income Inequality&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Matthew Stuart’s &lt;a href=&#34;https://www.theatlantic.com/magazine/archive/2018/06/the-birth-of-a-new-american-aristocracy/559130/&#34;&gt;“The 9.9 Percent Is the New American Aristocracy”&lt;/a&gt; and Jordan Weissman’s reply &lt;a href=&#34;https://slate.com/business/2018/05/forget-the-atlantics-9-9-percent-the-1-percent-are-still-the-problem.html&#34;&gt;“Actually, the 1 Percent Are Still the Problem”&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The first article is quite long, but easily skim-able. It focuses on not the super wealthy but the elite professionals that make up the upper class and the pernicious ways that group has convinced itself that membership is meritocratic when in reality parental wealth is inherited to a high degree. The various mechanism for this inheritance are in the article (knowing what preschools to go to, what SAT tutors to hire, etc.), but it also pointed out that this group (to which I belong) often refuses to acknowledge that they are upper class and do not represent the “average American.” And that self-delusion of meritocracy can be quite dangerous when it is used to justify the status quo.&lt;/p&gt;
&lt;p&gt;Weissman’s response makes the valid point that the top 90th percentile to the 99.9th percentile of the income distribution are quite heterogeneous: the elite professionals are in there but also are old retirees. And he points out that the difference in wealth and privilege between the 90th and 99th percentiles is quite large. Therefore, the 9.9% are not even a cohesive group, let alone “the New American Aristocracy.” Often when reading about a subject I’m not personally knowledgeable on, I find myself agreeing with whoever I’m reading. This Slate article pushed back on the data and factual claims in a way that I wouldn’t have been able to do myself just assessing the analytical arguments.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;partisan-economies&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Partisan Economies&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Annie Lowrey’s &lt;a href=&#34;https://www.theatlantic.com/politics/archive/2018/06/two-economies/561929/&#34;&gt;“Left Economy, Right Economy”&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An interesting look into how the partisan change in post-election economic expectations hasn’t disappeared after 2016. The article goes into how democrats and republicans have radically different economic expectations. This is one of the more easily measured areas of where partisan beliefs can impact non-political opinions. They also cite some good research that “politically-motivated” beliefs about the direction of the economy don’t impact consumer spending, indicating that perhaps the people espousing the motivated beliefs know that they are untrue.&lt;/p&gt;
&lt;p&gt;On some level, this shouldn’t be surprising. Democrats and republicans disagree on the empirical question of which policies improve the economy. On the other hand, to the extent that economies aren’t that impacted by government policies, it is a huge collection of people looking at the same data and coming to opposite conclusions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trump-tweets-and-hate-crimes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trump Tweets and Hate Crimes&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Karsten Müller and Carlo Schwarz’s &lt;a href=&#34;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3149103&#34;&gt;“Making America Hate Again? Twitter and Hate Crime Under Trump”&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A lot has been written about whether the 2016 election and Trump’s campaign emboldened racists. A new working paper from economists at the University of Warwick took a look at whether Trumps activity on Twitter is related to an increase in hate crimes. The abstract is below.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;: Social media has come under increasing scrutiny for reinforcing people’s pre-existing viewpoints which, it is argued, can create information “echo chambers.” We investigate whether social media motivates real-life action, with a focus on hate crimes in the United States. We show that the rise in anti-Muslim hate crimes since Donald Trump’s presidential campaign has been concentrated in counties with high Twitter usage. Consistent with a role for social media, Trump’s Tweets on Islam-related topics are highly correlated with anti-Muslim hate crime after, but not before the start of his presidential campaign, and are uncorrelated with other types of hate crimes. These patterns stand out in historical comparison: counties with many Twitter users today did not consistently experience more anti-Muslim hate crimes during previous presidencies.&lt;/p&gt;
&lt;/blockquote&gt;
This is the main figure:
&lt;center&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://g-tierney.github.io/img/tweets_hate_crimes.png&#34; /&gt;

&lt;/div&gt;
Figure 2: Panel (a) shows the weekly number of Donald Trump’s &lt;br/&gt; Islam-related tweets and the number of anti-Muslim hate crimes &lt;br/&gt; in the US after the start of Trump’s presidential campaign.
&lt;/center&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Some other key results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;“Trump’s Muslim tweets alone predict more than 20% of the variation in anti-Muslim hate crimes in the same week, but only after his campaign start; the explanatory power is less than 1% before.”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“[T]he number of hate crimes [in counties with low and high twitter usage] was more or less constant since 2009. With the start of Donald Trump’s presidential campaign on June, 16th 2015, however, we observe a disproportional increase in the number of hate crimes in those counties where many people use Twitter. There is no comparable increase in counties with low twitter usage.”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“In additional exercises, reported in Supplementary Material 6, we find complementary results for Hispanics. While the association weakens somewhat in the immediate run-up to the election in mid-2016, Table A.10 and Table A.11 show that Trump’s tweets about Hispanics have considerable predictive power for Ethnicity-based hate crimes. Again, this only holds true for the period after his campaign start, and not for other types of crime biases.”&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The correlation is quite striking. It kind of makes me wish that the &lt;a href=&#34;https://www.npr.org/sections/thetwo-way/2017/11/03/561770603/twitter-employee-blamed-for-deleting-presidents-account&#34;&gt;Twitter employee who deleted Trump’s account on their last day&lt;/a&gt; would come back to quit every day (or at least pick random days to delete the account).&lt;/p&gt;
&lt;p&gt;I thought this paper was really interesting for a couple of reasons. Foremost, it gets about as close as possible to claiming and showing that a correlation is causal. They do many, many robustness checks that, while imperfect, all indicate that alternative explanations are unlikely. Of course, correlation does not imply causation in the mathematical sense, but it does imply it in the colloquial sense. If Trump’s anti-Muslim tweets are in response to world events that would cause an increase in hate crimes absent his tweets, you would expect a similar correlation between his tweets and hate crimes before his presidential run. But that correlation isn’t observed. If high twitter usage counties are more prone to hate crimes, you would expect elevated levels of hate crimes before he became president. But that correlation also isn’t observed. Other counter arguments to the (unstated) causal claim are similarly addressed.&lt;/p&gt;
&lt;p&gt;If the paper did try to make a causal claim (which it does not), the appropriate counter-factual is an interesting discussion. The counter-factual of anti-Muslim hate crimes under a Clinton presidency is the most relevant for a judgement on Trump’s presidency, but hard to estimate. The counter-factual of Trump wants to send an anti-Muslim tweet but does not or cannot for some reason is less relevant, but much easier to estimate. Some ways to get at this counter-factual would be to identify times when Trump was not able to tweet. I’m not sure if he gets service on Air Force One, but time spent traveling or in meetings could be used as an instrument for his Islam-related tweets. Unfortunately, the time periods he is otherwise occupied with presidential work are probably too short and infrequent. Long periods while he is overseas, he might tweet at odd hours, so the exposure of Twitter to his tweets could be plausibly thought of as exogenously lowered. Essentially these methods try to find “random” reasons that Twitter was more or less less exposed to Trump, then use the variation in Trump’s Islam-related tweets attributable to those random shocks to explain variation in hate crimes. That final explanation is causal because the ultimate source of the variation is random.&lt;/p&gt;
&lt;p&gt;Another way to get at the counter-factual is to try to identify “random” tweets, i.e. anti-Muslim tweets not caused by other events. This kind of analysis has been explored in finance when doing event studies of how unexpected news affects a company’s stock price. Reviews of each of his at-issue tweets and a selection of events that could cause an increase in anti-Muslim hate crimes could identify events and tweets that are linked (such as the travel ban and resulting tweets), events that were not tweeted, and tweets not associated with events. The &lt;a href=&#34;https://www.predictit.org/home/browse?Search=tweets&amp;amp;isSearch=true&#34;&gt;PredictIt markets for the number of tweets he sends each week&lt;/a&gt; could even be used to create “expected” levels of Twitter activity, then deviations from that expectation could be treated as unexpected shocks. Of course, that relies on the strong assumption that PredictIt Twitter markets are efficient. Choosing the events to analyze would be hard too, but you could potentially algorithmically identify events that sparked many tweets related to Islam in the US and select from those.&lt;/p&gt;
&lt;p&gt;This last method gets at what I think is probably the true causal structure. Trump’s election and campaign emboldened people who harbored racial resentment and increased the baseline rate of hate crimes. Trump’s tweets are caused by news-worthy events that would have received coverage on traditional and social media. Those events and the coverage would have caused an increase in hate crimes regardless of his tweets. When he does choose to tweet about them, he increases their exposure and amplifies their impact on anti-Muslim hate crimes. Separating events and tweets into three categories (event-tweet pairs, unpaired events, and unpaired tweets) and comparing before and after Trump’s campaign began can fully test at least the correlations implied by this theory. To the degree that the pairing and non-pairing of tweets and events is exogenous, the correlations become causal as well. That exogeneity isn’t something that can be statistically tested but can come from a qualitative analysis of the time surrounding the event and/or tweet.&lt;/p&gt;
&lt;p&gt;The paper is going through the peer review process, so what comes out the other side may be quite different. It does a lot of interesting analysis and adds to an important discussion about the consequences of the 2016 election. I certainly do not expect this to be the last paper on the impact of Trump on racial resentment in America.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>The Genetics of Magic</title>
      <link>https://g-tierney.github.io/post/magic_classification/</link>
      <pubDate>Thu, 17 May 2018 00:00:00 +0000</pubDate>
      
      <guid>https://g-tierney.github.io/post/magic_classification/</guid>
      <description>
&lt;script src=&#34;https://g-tierney.github.io/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Last spring, I took a class on Bayesian statistics at the University of Chicago that had several exercises focused on building a model to classify species based on their genome. The basic setup was that you were given a data set of salmon, their genome sequencing data, and which sub-population they belonged to. From this data, we needed to build a model to classify new salmon into the sub-populations. The strategy was to use the fact that alleles appeared with different frequencies in the different sub-populations, so the fact that a new fish did have certain alleles and did not have others was informative about which sub-population it came from.&lt;/p&gt;
&lt;p&gt;This problem and solution struck me as being very similar to an issue near and dear to my heart: classifying decks into archetypes in collectible card games (CCGs). Specifically, Magic the Gathering. One could think of alleles as analogues for cards and the sub-populations as analogues for deck archetypes (UW Control, Jund, Merfolk, etc.). In this post, I will describe how to apply a Bayesian classification algorithm to this scenario and discuss some of its advantages. I assume the reader has familiarity with basic mathematical probability and statistics. I also will not show proofs for the mathematical claims that I make, but I will try to explain the intuition behind the concepts. Knowing the application area, collectible card games and Magic, will be necessary to understand the examples.&lt;/p&gt;
&lt;p&gt;Ultimately, with very minimal data cleaning, I am able to correctly classify 80% of decks played in a testing sample. Frequently played archetypes and archetypes that share few cards with others are more accurately classified. The code and data are available on &lt;a href=&#34;https://github.com/g-tierney/magic_deck_classification_multi_dir&#34;&gt;Github&lt;/a&gt;. I also describe some simple human-implementable improvements and additional features that should be included in practice.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-problem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Magic, and collectible card games broadly, are usually two-player games where each player brings their own deck of different cards. Many tournament-caliber decks are categorized into broader archetypes that differ only slightly. For example, Merfolk is a deck that plays creature cards with the Merfolk creature type and “lords” that give bonuses to all Merfolk cards. Not all Merfolk decks are identical, but a human who knows the game well can easily look at a deck-list and say if it is a Merfolk deck or not. Jund is a deck that plays green, black, and red cards that focus on trading resources with the opponent while eking out small advantages in each exchange. The variety of cards played in Jund is significantly higher than the variety of cards played in Merfolk decks.&lt;/p&gt;
&lt;p&gt;Classifying a single deck into an archetype is an easy task for a human who knows the game well, but frequently this classification has to be done at scale. Large tournaments happen every weekend, thousands of matches are played online every day, and it would take an extremely large team to classify all of those decks. A classification algorithm would ideally give a data-driven, suggested archetype and express how uncertain it was about that suggestion.&lt;/p&gt;
&lt;p&gt;I know of two types of practitioners who face this problem. First are the companies that make collectible card games. They are often interested in assessing whether there is a sufficiently diverse array of successful archetypes, which requires efficiently classifying decks then calculating things like play- and win-rates by archetype. The developers can then either ban or change existing cards to improve the format. Second are third-party websites. They often provide “meta-game reports” that cover which archetypes are successful or likely to become successful. For them, the value of archetype statistics is simply to report them to players who have to decide which deck to bring to a tournament.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Data&lt;/h2&gt;
&lt;p&gt;The method I will describe relies on having a lot of decks with known archetypes. I will use data from &lt;a href=&#34;https://www.mtggoldfish.com/&#34;&gt;MTG Goldfish&lt;/a&gt; on Modern decks played in Star City Games (SCG) events from 2014 to the present. Modern is a non-rotating format, which means that cards from older sets will always be legal in the format. In a rotating format like Standard, where only cards from sets released in the past two years are legal, data on older decks is not useful to classify newer decks. Additionally, the names of deck archetypes at Star City Games events are more standardized than the weekly data from Magic Online (MTGO), the digital version of the game. The algorithm could be used in these other scenarios with enough data, but deck-lists are only released for top finishers at SCG events and on MTGO.&lt;/p&gt;
&lt;p&gt;Over the 841 tournaments, I observe 8,249 unique decks, 516 unique deck archetypes, and 1,772 unique cards. I will separate a 300 deck sample to use for testing after constructing the model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-method&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Method&lt;/h2&gt;
&lt;p&gt;I’ll start by defining some notation. &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt; is a new deck that needs to be classified into an archetype. It is a list of card names and card quantities. It looks like this:&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
card
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
number
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Experiment One
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Goblin Guide
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Kird Ape
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wild Nacatl
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Burning-Tree Emissary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Tarmogoyf
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We also have a set of training decks. The same card name and card quantity variables are included, but it also includes the true archetype name. The ultimate goal is to use the information in the training decks to classify a new deck of an unknown archetype.&lt;/p&gt;
&lt;div id=&#34;maximum-likelihood&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Maximum Likelihood&lt;/h3&gt;
&lt;p&gt;The simplest approach to the classification problem is to use a maximum likelihood estimation. In the training set, find the frequency that each card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; appears among cards in archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; and treat this as the probability that a random card from archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; will be card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt;. For example, 1,653 Lightning Bolts appear among the 26,363 cards in Jund decks, so this probability would be estimated as 0.06. The likelihood that &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt; came from archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; can be thought of as the probability that 60 draws with replacement from a pool of every Magic card will result in &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt; when the probability of drawing each card is given by the probabilities estimated for archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; in the training data. Formally, this is a draw of size 60 from a Multinomial distribution on the population of every Magic card and probability parameters &lt;span class=&#34;math inline&#34;&gt;\(p_i = &amp;lt;p_{i,1},...,p_{i,1772}&amp;gt;\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(p_{i,c}\)&lt;/span&gt; is the proportion of card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; among cards in archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;. This is written below (note that &lt;span class=&#34;math inline&#34;&gt;\(x_c\)&lt;/span&gt; is the number of times card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; appeared in &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt;):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[P(D_{new}|\text{archetype}=i) = \frac{60!}{x_1! ... x_{1772}!} \prod_{c=i}^{1772}p_{c,i}^{x_c} \propto \prod_{c=i}^{1772}p_{c,i}^{x_c} \]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The most relevant part of this probability is the large product of each card frequency for each time it appears in &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt;. The factorials just count the number of ways &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt; could have been drawn. However, it can be completely ignored! We will classify &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt; as whatever archetype maximizes the likelihood. The factorials are the same for every archetype, so ignoring them will not change the result.&lt;/p&gt;
&lt;p&gt;The biggest drawback of this method is that if a card is never observed in a given archetype in the training data, the estimated likelihood that any deck containing that card is of that archetype is zero. This can be particularly problematic. A blue-white control deck that decides to use Baneslayer Angel as a finisher might have 59 cards identical to a blue-white control deck in the training set, but it will have a likelihood of zero if no training blue-white control deck contained Baneslayer Angel.&lt;/p&gt;
&lt;p&gt;The solution to this problem is to add pseudo counts to every deck in the training set. Suppose you round every zero frequency to some small number, say 0.01. Now if a new card appears in an archetype, the likelihood will not be zero, and if the rest of the deck matches a known archetype well, it can still be correctly classified. Picking this pseudo count number can be hard. It should maybe even be different for different decks based on how many times that archetype is observed in the training set because you are likely more confident in zero frequencies for decks that you observe many times.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-modeling&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Bayesian Modeling&lt;/h3&gt;
&lt;p&gt;The method I propose is essentially a more rigorous way of adding these pseudo counts by putting a Bayesian prior on the frequencies. A useful distribution on a set of frequencies is the Dirichlet distribution. A draw from an n-dimensional Dirichlet distribution is a set of n positive numbers that sum to one. Note that n-1 dimensions identify the sample because the last dimension must ensure the values all sum to one. It has n parameters, call each &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c\)&lt;/span&gt;, and the expected value for &lt;span class=&#34;math inline&#34;&gt;\(p_c\)&lt;/span&gt;, the frequency of card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt;, is &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c/\sum_{j=1}^n \alpha_j\)&lt;/span&gt;. As &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c\)&lt;/span&gt; increases, the variance decreases. To understand this distribution, its useful to first consider the two-dimensional case, called a &lt;a href=&#34;https://stephens999.github.io/fiveMinuteStats/beta.html&#34;&gt;Beta distribution&lt;/a&gt;. The link shows some useful visualizations of the distribution under different parameters. The Beta(1,1) distribution is uniform, Beta(5,5) has a peak at 0.5, Beta(1,4) has a peak at 0.25, Beta(0.1,0.1) has a minimum at 0.5 and has asymptotic behavior at 0 and 1.&lt;/p&gt;
&lt;p&gt;Suppose we start with a uniform Dirichlet prior for each deck (&lt;span class=&#34;math inline&#34;&gt;\(\alpha_c = 1\)&lt;/span&gt; for all &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt;), essentially starting from the position that all frequency combinations are equally likely. The intuition is that the model starts from the perspective that, without any data, the probability that a random draw of a single card from an archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; is a specific card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; is equal to &lt;span class=&#34;math inline&#34;&gt;\(1/n\)&lt;/span&gt;, where &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; is the number of unique cards (1,772 in my data). The probability that a random card from an Infect deck is Lightning Bolt is 1/1,772, the probability that a random card from a Tron deck is Urza’s Mine is 1/1,772, etc. The uncertainty about these probabilities is also the same for every card in every archetype. Of course these probabilities are wrong in practice, but they provide a sensible starting point and will be updated with data.&lt;/p&gt;
&lt;p&gt;Next, we observe the training data for each archetype, &lt;span class=&#34;math inline&#34;&gt;\(D_t\)&lt;/span&gt;. Assuming that each archetype has a unique set of card frequencies, using proper Bayesian updating, the posterior distribution of frequencies (the distribution conditional on the data) for each archetype is also Dirichlet with parameters &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c = 1 + n_c\)&lt;/span&gt;, the number of times card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; appeared in the archetype across all decks in &lt;span class=&#34;math inline&#34;&gt;\(D_t\)&lt;/span&gt;. That is, starting from the prior stated above and given the training data, beliefs about the true card frequencies for each archetype are described by this distribution. To understand the intuition, after observing the training data, the probability that a single random card drawn from archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; is equal to a specific card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; is (1 + the number of card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; that appeared in archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(n_{c,i}\)&lt;/span&gt;))/(&lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; + the number of cards from archetype &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(n_i\)&lt;/span&gt;)). For example, Lightning Bolt was never observed in an Infect deck, so the posterior expected probability that a random card from an Infect deck is Lightning Bolt is &lt;span class=&#34;math inline&#34;&gt;\((1+0)/(1,772+21,607) \approx 0.00013\)&lt;/span&gt;. Four copies of Urza’s Mine were observed in every Tron deck, so the posterior expected probability that a random card from a Tron deck is Urza’s Mine is &lt;span class=&#34;math inline&#34;&gt;\((1+1,272)/(1,772+19,099) \approx 0.061\)&lt;/span&gt;, which is pretty close to &lt;span class=&#34;math inline&#34;&gt;\(4/60 = 0.0\overline{6}\)&lt;/span&gt;. Using this method, an Infect deck splashing Lightning Bolt would not be ruled impossible, just extremely unlikely, and the model has determined that nearly every Tron deck has four Urza’s Mines. The posterior also captures differing levels of confidence in these estimations. The variance in frequencies for archetypes observed many times in &lt;span class=&#34;math inline&#34;&gt;\(D_t\)&lt;/span&gt; is lower than for archetypes observed fewer times because the variance decreases as the sum of the &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; parameters increases. In other words, I am more confident in the estimation of the frequency of Urza’s Mine in Tron, a heavily played deck, than I am about the frequency of Blood Moon in Skred Red decks, a rarely played deck.&lt;/p&gt;
&lt;p&gt;Finally, when encountering a new deck, we want to compute a posterior probability that the deck came from each of the archetypes we saw in the training data. Adopting a prior that without data each archetype is equally probable implies that the likelihood alone will determine this probability. A new deck, &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt;, is a collection of 60 cards defined by a vector of &lt;span class=&#34;math inline&#34;&gt;\(x_c\)&lt;/span&gt;’s, the number of times card &lt;span class=&#34;math inline&#34;&gt;\(c\)&lt;/span&gt; appears in &lt;span class=&#34;math inline&#34;&gt;\(D_{new}\)&lt;/span&gt;. Our model interprets this deck as a draw of size 60 from a Multinomial-Dirichlet distribution on the set of all unique cards with Dirichlet parameters set by the posterior described above. This distribution is a Multinomial distribution where the frequency parameters are an unknown draw from a known Dirichlet distribution. So, for each archetype, the likelihood is computed as as:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[P(\text{archetype} = i | D_{new},D_t,\alpha=1) \propto P(D_{new} | \text{archetype} = i,D_t,\alpha=1) = \]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[f(D_{new};\text{size} = 60,\alpha_c = 1+n_{c,i}) = \frac{(60!) \Gamma(\sum_{c=1}^{1772} \alpha_c)}{\Gamma(60 + \sum_{c=1}^{1772} \alpha_c)} \prod_{c=1}^{1772} \frac{x_c + \alpha_c}{(x_c!) \Gamma(\alpha_c)} \]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(f(x)\)&lt;/span&gt; is the pmf of the Multinomial-Dirichlet distribution and &lt;span class=&#34;math inline&#34;&gt;\(\Gamma\)&lt;/span&gt; is the gamma function, which is very similar to a factorial that can be applied non-integers. This function does look a lot more complicated than the simple likelihood before. That is because it is accounting for the uncertainty in the frequencies. You could simplify this function by using just the expectation of each frequency for each deck (&lt;span class=&#34;math inline&#34;&gt;\(\alpha_c\)&lt;/span&gt; divided by the sum of the &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c\)&lt;/span&gt;’s). That likelihood would look like the simple likelihood described previously where you can ignore the factorial constant and just take the product of each expected probability exponentiated by &lt;span class=&#34;math inline&#34;&gt;\(x_c\)&lt;/span&gt;. It would solve the problem of zero frequency cards. However, you would lose information about how confident you were in each expectation. The estimated frequencies of an archetype only observed twice in &lt;span class=&#34;math inline&#34;&gt;\(D_t\)&lt;/span&gt; should be viewed with more suspicion than the frequencies for one observed 500 times. That is the reason the gamma functions outside of the product cannot be ignored in this likelihood.&lt;/p&gt;
&lt;p&gt;After computing each of these, standardize them to sum to one and you have the probability that the new deck came from each of the observed archetypes. Classify it into whichever archetype has the highest probability and you’re done!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;Available on &lt;a href=&#34;https://github.com/g-tierney/magic_deck_classification_multi_dir&#34;&gt;Github is R code&lt;/a&gt; that implements the above method, allowing for a flexible specification of &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; in the prior. I chose &lt;span class=&#34;math inline&#34;&gt;\(\alpha=1\)&lt;/span&gt; because it represents starting from a uniform distribution, but the choice of prior is certainly open for debate. I also show the results for &lt;span class=&#34;math inline&#34;&gt;\(\alpha=0\)&lt;/span&gt;, which corresponds to the initial maximum likelihood setup where zero frequencies are possible.&lt;/p&gt;
&lt;p&gt;The results for &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; equal to 0, 1, and 0.01 are reported below. &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; of 0.01 corresponds to a prior with a smaller effect on the posterior (each &lt;span class=&#34;math inline&#34;&gt;\(\alpha_c\)&lt;/span&gt; will be &lt;span class=&#34;math inline&#34;&gt;\(0.01 + n_{c,i}\)&lt;/span&gt;) and frequencies closer to zero are more likely. Correct is the number of correct classifications, Size is the number of decks classified, and Rate is the ratio of those two. The Confident columns subset the results to classifications were the posterior probability of the suggested classification is greater than 95%.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Alpha
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Correct
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Size
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Rate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Correct (Confident)
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Size (Confident)
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Rate (Confident)
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
300
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
NaN
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.01
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
235
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
300
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.78
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
224
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
278
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.81
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
239
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
300
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.80
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
236
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
292
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.81
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The value of &lt;span class=&#34;math inline&#34;&gt;\(\alpha=1\)&lt;/span&gt; appears to be the most accurate, getting 80% of classifications correct, but the smaller &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; is not very different. It is also worth noting that the method appears to be too confident in its classifications. One would hope the accuracy rate would be similar to the estimated probability, but it appears that the classifications made with at least 95% probability have an accuracy rate well below that number.&lt;/p&gt;
&lt;p&gt;I will look at the accuracy rate by archetype as well. The table below shows the five most frequent and five least frequent archetypes observed in the testing data. Total is the number of times the archetype appears, Proportion Correct is the proportion of those decks that are correctly classified, and Mode Incorrect is the most common incorrect classification (missing values indicate that all decks are correctly classified).&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Deck
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Total
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Proportion Correct
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Mode Incorrect
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Infect
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
19
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Jund
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
17
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Naya Burn
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.93
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Burn
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Tron
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Affinity
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
14
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr Prison
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr Control
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr Prison
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr Prison
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Rw Nahiri
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wrg
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Naya Zoo
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wu Control
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.00
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Uw Control
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Infect, Tron, and Affinity are very unique decks, so it is not surprising that the algorithm correctly classifies them. Jund is very similar to many other black green midrange decks, so I was surprised they were all correctly identified. That could be because Jund is a very common archetype in the training decks, so the algorithm has enough data to separate Jund from similar decks. Naya Burn was most frequently miss classified as normal Burn, which is not very surprising.&lt;/p&gt;
&lt;p&gt;Among the least frequent archetypes, the mistakes are not unexpected. Its possible that WR, WR Control, and WR Prison should be the same archetype anyway, and the difference between WU Control versus UW Control is just which color is more prevalent.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;human-improvements&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Human Improvements&lt;/h2&gt;
&lt;p&gt;I think a lot of the time, people want their statistical model to work without any human input. However, human-level tweaks often provide significantly greater performance improvements than tweaks to the statistical methods. Three areas where an analyst with domain knowledge could improve the method are outlined below.&lt;/p&gt;
&lt;p&gt;First, standardizing deck archetypes. I made a few changes to these but wanted to leave the data mostly raw. For example, I made “U/R Twin” and “UR Twin” the same deck, changed “Death &amp;amp; Taxes” to “Death And Taxes”, and standardized capitalization. Several remaining archetypes, however, should probably be grouped together. Some low hanging fruit are probably the 38 “Naya Through the Breach” decks and the 21 “Naya Titan Breach” decks, and the 88 “UB Tezzerator” and 26 “UB Tezzeret” decks. These kind of changes could be made incrementally as more data are added by standardizing the names of new decks and by someone who was unfamiliar with the code and statistics, but was familiar with the Modern format.&lt;/p&gt;
&lt;p&gt;Another area an analyst could improve these results is by selectively including true zeros in the likelihood. For example, Abzan compared to Abzan Company decks differ by whether the card Collected Company is included in the list. With enough data, the algorithm will learn this distinction, but that identification could be expedited and accuracy increased if the Abzan likelihood function had a zero frequency for Collected Company. A similar method could be used to separate similar archetypes defined by color splashes, such as distinguishing Jeskai Twin, Grixis Twin, and UR Twin by including zeros for lands by which color they produce.&lt;/p&gt;
&lt;p&gt;A third and final area more domain knowledge could help is manually reviewing low probability classifications. Let’s look at the most uncertain classifications. These six decks are the ones the algorithm had the most uncertainty about. An analyst could review these decks to separate archetypes that are quite similar.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Deck
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Classification
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Probability Correct
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Correct
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Burn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Burn
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.59
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
TRUE
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-Blue Grand Architect
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-Blue Turns
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.74
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FALSE
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Tasigur Burn
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Unknown
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.77
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FALSE
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-Blue Grand Architect
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-Blue Turns
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.84
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FALSE
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-White Human Aggro
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Mono-White Humans
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.85
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FALSE
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Wr Prison
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Rw Nahiri
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.90
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
FALSE
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;implementation-in-practice&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Implementation in Practice&lt;/h2&gt;
&lt;p&gt;A company or data scientist building this tool for frequent use in an analysis pipeline would want to design a few features beyond the one-time estimation and classification described here.&lt;/p&gt;
&lt;p&gt;Training data updating. CCGs evolve over time with new card releases and novel combinations discovered by players. When a new archetype is formed, examples of it need to be added to the training data. To that end, and to update old archetypes with new cards, samples of decks should be regularly classified by humans and added to the training data, with special emphasis placed on getting sufficient samples of new archetypes. It would also be wise to sample more heavily from decks with low probability classifications and track accuracy rates of the new training decks.&lt;/p&gt;
&lt;p&gt;Archetype hierarchies. Many of the incorrect classifications were from mistaking sub-archetypes, such as confusing Naya Burn for Burn or Wilt-Leaf Abzan for Abzan. A two-stage classification could be more accurate and better address the questions practitioners are asking. An analyst would place some archetypes into categories, all Burn decks into one category and all Abzan into another. The algorithm would be implemented twice. Once to place a deck into a category then again to place the deck within that category. The higher-level category will probably be more accurate and sufficient for most practitioners’ purposes.&lt;/p&gt;
&lt;p&gt;Prior selection. The uniform prior of &lt;span class=&#34;math inline&#34;&gt;\(\alpha = 1\)&lt;/span&gt; is useful for interpretation, but practitioners may find it to be less accurate for archetypes with small sample sizes. Selecting the &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; that maximizes the accuracy rate in the training data is one simple option, or a cross-validation method could pick &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt; to maximize the accuracy rate across many samples.&lt;/p&gt;
&lt;p&gt;Incorporating game-specific features. Every CCG has unique deck-building rules, which could be built into the likelihood function. Magic decks cannot contain more than four of a single card and no fewer than 60 total cards, so frequencies greater than 4/60 are impossible. In Hearthstone, the cap is two for some cards, one for others, and decks must be exactly 30 cards. Deck size is already included as a parameter in the Multinomial-Dirichlet distribution, but card limits are not. Two alternatives I tried were treating each deck as a series of Bernoulli random variables indicating whether a card was present or not in a deck and treating a deck as a draw from several Multinomial distributions of size 4 (or the appropriate card limit). They were not as accurate as the method described here, but potentially could be improved. At the least, they are worth experimenting with on new data sets.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;machine-learning-methods&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Machine Learning Methods&lt;/h2&gt;
&lt;p&gt;I am certain that there are machine learning methods that address this problem as well. Classification is a well-studied topic in machine learning and this classification problem does not present too many unique challenges. However, I think this Bayesian approach has certain advantages in model updating and uncertainty quantification. It is quite easy to update the model with new, correctly-labeled decks. Simply add the new card counts the appropriate archetype in the training data. Many ML methods would need to re-calibrate tuning and regularization parameters with new training data, which could be quite time consuming. The second advantage is uncertainty quantification. Given the prior, this model explicitly reports the probability that the classification is correct. This is useful in flagging cases for manual review, as noted above.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;While not perfect, this algorithm should become quite adept at identifying archetypes with enough data. In formats without much turnover in the top decks, collecting the amount of data required is not very difficult. In rotating formats, however, it could be much harder. The people interested in this kind of application, however, might have more than just the top decks from weekend tournaments. Wizards of the Coast, the company that makes Magic, can observe every game played on Magic Online, tournament organizers collect deck lists from every player, and Vicious Syndicate has a popular deck tracking app for Hearthstone. They can certainly collect enough samples of decks, and perhaps the method described here could improve their classification accuracy or reduce the human labor required.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Political Text Analysis</title>
      <link>https://g-tierney.github.io/teaching/political_text/</link>
      <pubDate>Wed, 27 Apr 2016 00:00:00 -0400</pubDate>
      
      <guid>https://g-tierney.github.io/teaching/political_text/</guid>
      <description></description>
    </item>
    
  </channel>
</rss>