-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSimple Stats.qmd
297 lines (220 loc) · 11.5 KB
/
Simple Stats.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Simple Stats"
format: html
editor: visual
author: "Tyler Riccio"
editor_options:
chunk_output_type: console
---
## How to Code in R
Object
: Some complex but structured "thing" that holds some information, like data.
Function
: Code that takes some input, does some things and returns some output. For example, $y=mx+b$ takes an input `x` and returns some output `y`. In R, that's written as `function(x) m*x + b`. You'll learn to understand functions more by using them.
Variable (programming)
: A variable is used to hold some information you'll need later. It's done using the `<-` operator.
Piping
: The pipe is the reason people use R, it allows you to chain together operations using `%>%`. Simply, it takes something on the left hand side of the pipe and forwards it to the right hand side. It's a subtle concept that you'll never really appreciate until you have to use another language.
## Loading Dependencies
In the beginning of each script, you'll have to load dependencies. Dependencies can be libraries or data. **Libraries** are pre-built functions that someone else already wrote, they're the building blocks of the R world.
For most scripts, you'll load the `tidyverse` which is a *very* common collection of packages that is integral to many operations. `gt` and `gtExtras` are for creating nice tables. `nflreadr` is for querying NFL data.
```{r}
#| echo: false
library(tidyverse)
library(gt)
library(gtExtras)
library(nflreadr)
```
## Data
Data is should follow certain conventions to make it consistent and effective. First, each variable should get a column and each observation should get a row. For example, if you want a table of games, each row should be a game and each column should be a detail about the game. Under no circumstances should a game have two rows.
In R, data is loaded into memory as a variable in the global "environment". This means the data can be accessed in the script. The typical data structure is a data frame, in both R and Python. This structure is like a regular excel sheet, an object with columns and rows.
For most things, we'll use the `tibble` function instead of the `data.frame` function. Since the reasons behind this are complicated, don't worry about why, just know it's much better.
```{r}
tibble(
game_id = seq(5),
team_epa = rnorm(5)
)
```
### Loading Data
Data is loaded by calling some function and assigning the results to some variable.
```{r}
#| output: FALSE
raw_ngs_data <- load_nextgen_stats(seasons = 2023, stat_type = 'receiving')
```
Now you can check your data by typing the variable into the console. The below will give you a quick preview of what the data looks like, truncating if need be.
```{r}
raw_ngs_data
```
You can check your column names by doing this.
```{r}
colnames(raw_ngs_data)
```
## Filtering and Aggregating Data
Once you have the data, you'll need to check if it's already aggregated by the unit of measurement you want. In this case, it's done on a weekly bases, so each row is a week.
::: callout-note
## NGS data is a very rare case of 2 units of aggregation, seasonal and weekly in the same dataset. It's hard to stress how unusual this is and I only know this because I've read the data dictionary. Seasonal aggregation is indicated by week 0, so I'm removing it.
:::
```{r}
clean_ngs_data <- filter(raw_ngs_data, week != 0)
n_distinct(clean_ngs_data$week) # how many distinct weeks are in the data?
```
Once the data is clean, you can aggregate it by each player.
```{r}
# Take the data and forward it using the pipe
agg_data <- clean_ngs_data %>%
# Take the data forwarded and group by the name
group_by(player_display_name) %>%
# take the grouped data and take the average of the `avg_separation` column
# you can add other summary stats here too like `avg_cushion`
summarize(avg_separation = median(avg_separation),
n = n(), # This just counts the number of observations (weeks) per group (player)
avg_cushion = median(avg_cushion)) %>%
# ungroup the data
ungroup()
agg_data
```
Once you have this aggregated data you can sort using the `arrange` function.
```{r}
# take the data and pipe it to the arrange function
agg_data <- agg_data %>%
# pass the column to the arrange function
# use the `-` to make it descending (highest number first)
arrange(-avg_separation)
agg_data
```
As you can see, there are some very high average separation numbers for some random players. This is because there are only a few weeks of logged data for the player. This is why I put the `n` column in `summarize()`. This is a very common problem and it's reasonable to set some limit like 10 games or so.
```{r}
# Take the agg data and pipe it to filter
filtered_agg_data <- agg_data %>%
filter(n > 10)
filtered_agg_data
```
## Visualizing Results
Once you have your data, you can actually do something with it by visualizing the results. This is made very easy by the `gt` package, which builds tables from data.
```{r}
filtered_agg_data %>%
# slice to the top 15 players only for ease of use
slice_head(n = 15) %>%
# pipe the data to the gt() function
gt() %>%
# take the table and hide some columns you don't really care about
cols_hide(columns = c(n, avg_cushion)) %>%
# format the average separation to remove some decimals
fmt_number(columns = avg_separation, decimals = 2) %>%
# add a little title
tab_header(title = "Top Separators this Season",
subtitle = "The top 15 players by average separation.") %>%
# add a little color
gt_hulk_col_numeric(columns = avg_separation) %>%
# theme our table like pff does
gt_theme_pff()
```
## Deeper Analysis
Many times a single statistic is misleading. Separation is heavily influenced by cushion, since the higher the cushion, the more a player may be able to separation, since they can close space with more freedom. To check this interaction, it's useful to plot the two variables against each other.
This part is a little more advanced but it's an integral part to digging into stats past the initial value. Later we can weight and actually account for any interaction ourselves, by creating a new stat.
::: callout-note
## You may notice the use of the \`+\` instead of the \`%\>%\` pipe. This is because ggplot functions we use require the \`+\`. This is the only package that's like this because it was written before the traditional pipe.
:::
```{r}
# take the data and pipe it to ggplot, which is a plotting function
filtered_agg_data %>%
# avg_cushion as x, separation as y and the name as the label
ggplot(aes(x = avg_cushion,
y = avg_separation)) +
# point plot
geom_point() +
# draw a little line
geom_smooth(method = 'lm',
formula = 'y ~ x',
se = F) +
ggrepel::geom_text_repel(mapping = aes(label = player_display_name),
max.overlaps = 10) +
# add some theming and titles
ggthemes::theme_fivethirtyeight() +
labs(title = "Expected Separation by Cushion",
subtitle = "Using the cushion given, we can predict some portion of the separation. In math terms, cushion is responsible for some significant variance in the separation.") +
# don't worry about this line
theme(axis.title.x = element_text(),
axis.title.y = element_text()) +
# add your axis labels
xlab("Cushion Given") +
ylab("Separation")
```
You can measure this relationship mathematically by taking the correlation. The result is .41, which means 41% of the average separation is *likely* (big assumption) relating to the cushion in one way or another, called variance. This number is high, especially for two essentially independent stats.
```{r}
cor(x = filtered_agg_data$avg_separation, y = filtered_agg_data$avg_cushion)
```
## Machine Learning!
So we know the statistic of separation is helpful but still influenced a lot by some other factor like cushion. The question is what to do about it. There are a ton of approaches you can take here but I tend to like the machine learning approach.
This approach is defined by defining some simple model to predict the separation using the cushion. Once we have this predicted value, we calculated the separation **over** this prediction. So, for players like Drake London who have receive a good deal of cushion but low separation, we would say he has separation *under* expectation. Logically, a receiver who generates separation above expected will be the better one, since they're (on average) being covered more closely.
### The Code
This part is obviously more advanced but I'll try to do it in the most simplistic way, eventually it'll be second nature.
```{r}
# define the model
# prediction separation using cushion
mod <- lm(formula = avg_separation ~ avg_cushion, data = filtered_agg_data)
```
Use the model to predict the `x_separation` column, or predicted separation. Don't worry about much of this code, since it's a little involved. Just know it's appending the predicted separation to the data.
```{r}
# augment our data
aug_data <- predict(object = mod, new_data = filtered_agg_data) %>%
as_tibble_col("x_separation") %>%
bind_cols(filtered_agg_data)
aug_data
```
Now we'll compute the new column that measures the separation over the expected.
```{r}
aug_data <- aug_data %>%
# use mutate to create the new column
mutate(separation_oe = avg_separation - x_separation)
```
## Visualizing Results
Now we can create a new table, just like the old one that measures this.
```{r}
aug_data %>%
# slice to the top 15 players only for ease of use
slice_max(order_by = separation_oe, n = 15) %>%
# pipe the data to the gt() function
gt() %>%
# take the table and hide some columns you don't really care about
cols_hide(columns = c(n, avg_cushion)) %>%
# format the average separation to remove some decimals
fmt_number(columns = avg_separation, decimals = 2) %>%
# add a little title
tab_header(title = "Top Separators over Expected",
subtitle = "Separation adjusting for variance due to cushion.") %>%
# add a little color
gt_hulk_col_numeric(columns = c(avg_separation, separation_oe)) %>%
# theme our table like pff does
gt_theme_pff()
```
Let's plot the new adjusted separation by raw separation to see the biggest movement.
```{r}
# take the data and pipe it to ggplot, which is a plotting function
aug_data %>%
# avg_cushion as x, separation as y and the name as the label
ggplot(aes(x = x_separation,
y = avg_separation)) +
# point plot
geom_point() +
# draw a little line
geom_smooth(method = 'lm',
formula = 'y ~ x',
se = F) +
ggrepel::geom_text_repel(mapping = aes(label = player_display_name),
max.overlaps = 10) +
# add some theming and titles
ggthemes::theme_fivethirtyeight() +
# don't worry about this line
theme(axis.title.x = element_text(),
axis.title.y = element_text()) +
# add your axis labels
xlab("Expected Separation") +
ylab("Separation")
```
## What Next
This was a very simple example in walking through how to get the stat, compute and aggregate, visualize then do more in depth analysis by creating a new adjusted stat. There are some obvious issues with this stat and it's obviously not intended to be the be all end all.
- Only 2023 data - a model needs **much** more data.
- Low quality stat - separation and cushion is pre-aggregated, which makes it a lower quality.
- Biased model - this is a more advanced concept but the model simply has low ability.xs
- Additional features - the model clearly needs more information to make a prediction.