forked from tidymodels/TMwR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
17-encoding-categorical-data.Rmd
315 lines (229 loc) · 19.8 KB
/
17-encoding-categorical-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
```{r categorical-setup, include = FALSE}
library(tidymodels)
library(embed)
library(textrecipes)
library(kableExtra)
tidymodels_prefer()
source("ames_snippets.R")
neighborhood_counts <- count(ames_train, Neighborhood)
```
# Encoding Categorical Data {#categorical}
For statistical modeling in R, the preferred representation for categorical or nominal data is a _factor_, which is a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.[^python] In Section \@ref(dummies) we introduced feature engineering approaches to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the `Bldg_Type` in our Ames housing data (with levels `r knitr::combine_words(glue::backtick(levels(ames_train$Bldg_Type)))`), to a set of dummy or indicator variables like those shown in Table \@ref(tab:encoding-dummies).
[^python]: This is in contrast to statistical modeling in Python, where categorical variables are often directly represented by integers alone, such as `0, 1, 2` representing red, blue, and green.
```{r encoding-dummies, echo = FALSE, results = 'asis'}
show_rows <-
ames_train %>%
mutate(.row = row_number()) %>%
group_by(Bldg_Type) %>% dplyr::select(Bldg_Type, .row) %>%
slice(1) %>%
pull(.row)
recipe(~Bldg_Type, data = ames_train) %>%
step_mutate(`Raw Data` = Bldg_Type) %>%
step_dummy(Bldg_Type, naming = function(var, lvl, ordinal = FALSE, sep = "_") lvl) %>%
prep() %>%
bake(ames_train) %>%
slice(show_rows) %>%
arrange(`Raw Data`) %>%
kable(caption = "Dummy or indicator variable encodings for the building type predictor in the Ames training set.",
label = "encoding-dummies") %>%
kable_styling(full_width = FALSE)
```
Many model implementations require such a transformation to a numeric representation for categorical data.
:::rmdnote
Appendix \@ref(pre-proc-table) presents a table of recommended preprocessing techniques for different models; notice how many of the models in the table require a numeric encoding for all predictors.
:::
However, for some realistic data sets, straightforward dummy variables are not a good fit. This often happens because there are *too many* categories or there are *new* categories at prediction time. In this chapter, we discuss more sophisticated options for encoding categorical predictors that address these issues. These options are available as tidymodels recipe steps in the [`r pkg(embed)`](https://embed.tidymodels.org/) and [`r pkg(textrecipes)`](https://textrecipes.tidymodels.org/) packages.
## Is an Encoding Necessary?
A minority of models, such as those based on trees or rules, can handle categorical data natively and do not require encoding or transformation of these kinds of features. A tree-based model can natively partition a variable like `Bldg_Type` into groups of factor levels, perhaps `OneFam` alone in one group and `Duplex` and `Twnhs` together in another group. Naive Bayes models are another example where the structure of the model can deal with categorical variables natively; distributions are computed within each level, for example, for all the different kinds of `Bldg_Type` in the data set.
These models that can handle categorical features natively can _also_ deal with numeric, continuous features, making the transformation or encoding of such variables optional. Does this help in some way, perhaps with model performance or time to train models? Typically no, as Section 5.7 of @fes shows using benchmark data sets with untransformed factor variables compared with transformed dummy variables for those same features. In short, using dummy encodings did not typically result in better model performance but often required more time to train the models.
:::rmdnote
We advise starting with untransformed categorical variables when a model allows it; note that more complex encodings often do not result in better performance for such models.
:::
## Encoding Ordinal Predictors
Sometimes qualitative columns can be *ordered*, such as "low," "medium," and "high". In base R, the default encoding strategy is to make new numeric columns that are polynomial expansions of the data. For columns that have five ordinal values, like the example shown in Table \@ref(tab:encoding-ordered-table), the factor column is replaced with columns for linear, quadratic, cubic, and quartic terms:
```{r encoding-ordered-table, echo = FALSE, results = 'asis'}
ord_vals <- c("none", "a little", "some", "a bunch", "copious amounts")
ord_data <- tibble::tibble(`Raw Data` = ordered(ord_vals, levels = ord_vals))
ord_contrasts <-
model.matrix(~., data = ord_data) %>%
round(2) %>%
as.data.frame() %>%
dplyr::select(-`(Intercept)`) %>%
setNames(c("Linear", "Quadratic", "Cubic", "Quartic"))
bind_cols(ord_data, ord_contrasts) %>%
kable(caption = "Polynominal expansions for encoding an ordered variable.",
label = "encoding-ordered-table") %>%
kable_styling(full_width = FALSE)
```
While this is not unreasonable, it is not an approach that people tend to find useful. For example, an 11-degree polynomial is probably not the most effective way of encoding an ordinal factor for the months of the year. Instead, consider trying recipe steps related to ordered factors, such as `step_unorder()`, to convert to regular factors, and `step_ordinalscore()`, which maps specific numeric values to each factor level.
## Using the Outcome for Encoding Predictors
There are multiple options for encodings more complex than dummy or indicator variables. One method called _effect_ or _likelihood encodings_ replaces the original categorical variables with a single numeric column that measures the effect of those data [@MicciBarreca2001; @Zumel2019]. For example, for the neighborhood predictor in the Ames housing data, we can compute the mean or median sale price for each neighborhood (as shown in Figure \@ref(fig:encoding-mean-price)) and substitute these means for the original data values:
```{r encoding-mean-price}
#| fig.cap = "Mean home price for neighborhoods in the Ames training set, which can be used as an effect encoding for this categorical variable",
#| fig.alt = "A chart with points and error bars for the mean home price for neighborhoods in the Ames training set. The most expensive neighborhoods are Northridge and Stone Brook, while the least expensive are Iowa DOT and Railroad and Meadow Village."
ames_train %>%
group_by(Neighborhood) %>%
summarize(mean = mean(Sale_Price),
std_err = sd(Sale_Price) / sqrt(length(Sale_Price))) %>%
ggplot(aes(y = reorder(Neighborhood, mean), x = mean)) +
geom_point() +
geom_errorbar(aes(xmin = mean - 1.64 * std_err, xmax = mean + 1.64 * std_err)) +
labs(y = NULL, x = "Price (mean, log scale)")
```
This kind of effect encoding works well when your categorical variable has many levels. In tidymodels, the `r pkg(embed)` package includes several recipe step functions for different kinds of effect encodings, such as `step_lencode_glm()`, `step_lencode_mixed()`, and `step_lencode_bayes()`. These steps use a generalized linear model to estimate the effect of each level in a categorical predictor on the outcome. When using a recipe step like `step_lencode_glm()`, specify the variable being encoded first and then the outcome using `vars()`:
```{r}
library(embed)
ames_glm <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_glm
```
As detailed in Section \@ref(recipe-functions), we can `prep()` our recipe to fit or estimate parameters for the preprocessing transformations using training data. We can then `tidy()` this prepared recipe to see the results:
```{r}
glm_estimates <-
prep(ames_glm) %>%
tidy(number = 2)
glm_estimates
```
When we use the newly encoded `Neighborhood` numeric variable created via this method, we substitute the original level (such as `"North_Ames"`) with the estimate for `Sale_Price` from the GLM.
Effect encoding methods like this one can also seamlessly handle situations where a novel factor level is encountered in the data. This `value` is the predicted price from the GLM when we don't have any specific neighborhood information:
```{r}
glm_estimates %>%
filter(level == "..new")
```
:::rmdwarn
Effect encodings can be powerful but should be used with care. The effects should be computed from the training set, after data splitting. This type of supervised preprocessing should be rigorously resampled to avoid overfitting (see Chapter \@ref(resampling)).
:::
When you create an effect encoding for your categorical variable, you are effectively layering a mini-model inside your actual model. The possibility of overfitting with effect encodings is a representative example for why feature engineering _must_ be considered part of the model process, as described in Chapter \@ref(workflows), and why feature engineering must be estimated together with model parameters inside resampling.
### Effect encodings with partial pooling
Creating an effect encoding with `step_lencode_glm()` estimates the effect separately for each factor level (in this example, neighborhood). However, some of these neighborhoods have many houses in them, and some have only a few. There is much more uncertainty in our measurement of price for the single training set home in the `r neighborhood_counts %>% slice_min(n) %>% pull(Neighborhood) %>% as.character()` neighborhood than the `r neighborhood_counts %>% slice_max(n) %>% pull(n)` training set homes in North Ames. We can use *partial pooling* to adjust these estimates so that levels with small sample sizes are shrunken toward the overall mean. The effects for each level are modeled all at once using a mixed or hierarchical generalized linear model:
```{r}
ames_mixed <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_lencode_mixed(Neighborhood, outcome = vars(Sale_Price)) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_mixed
```
Let's `prep()` and `tidy()` this recipe to see the results:
```{r}
mixed_estimates <-
prep(ames_mixed) %>%
tidy(number = 2)
mixed_estimates
```
New levels are then encoded at close to the same value as with the GLM:
```{r}
mixed_estimates %>%
filter(level == "..new")
```
:::rmdnote
You can use a fully Bayesian hierarchical model for the effects in the same way with `step_lencode_bayes()`.
:::
Let's visually compare the effects using partial pooling vs. no pooling in Figure \@ref(fig:encoding-compare-pooling):
```{r encoding-compare-pooling, message=FALSE}
#| fig.cap = "Comparing the effect encodings for neighborhood estimated without pooling to those with partial pooling",
#| fig.alt = "A scatter chart comparing the effect encodings for neighborhood estimated without pooling to those with partial pooling. Almost all neighborhoods are very close to the slope = 1 line, but the neighborhoods with the fewest homes are further away."
glm_estimates %>%
rename(`no pooling` = value) %>%
left_join(
mixed_estimates %>%
rename(`partial pooling` = value), by = "level"
) %>%
left_join(
ames_train %>%
count(Neighborhood) %>%
mutate(level = as.character(Neighborhood))
) %>%
ggplot(aes(`no pooling`, `partial pooling`, size = sqrt(n))) +
geom_abline(color = "gray50", lty = 2) +
geom_point(alpha = 0.7) +
coord_fixed()
```
Notice in Figure \@ref(fig:encoding-compare-pooling) that most estimates for neighborhood effects are about the same when we compare pooling to no pooling. However, the neighborhoods with the fewest homes in them have been pulled (either up or down) toward the mean effect. When we use pooling, we shrink the effect estimates toward the mean because we don't have as much evidence about the price in those neighborhoods.
## Feature Hashing
Traditional dummy variables as described in Section \@ref(dummies) require that all of the possible categories be known to create a full set of numeric features. _Feature hashing_ methods [@weinberger2009feature] also create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. Let's look at the `Neighborhood` values in Ames again and use the `rlang::hash()` function to understand more:
```{r}
library(rlang)
ames_hashed <-
ames_train %>%
mutate(Hash = map_chr(Neighborhood, hash))
ames_hashed %>%
select(Neighborhood, Hash)
```
If we input Briardale to this hashing function, we will always get the same output. The neighborhoods in this case are called the "keys," while the outputs are the "hashes."
:::rmdnote
A hashing function takes an input of variable size and maps it to an output of fixed size. Hashing functions are commonly used in cryptography and databases.
:::
The `rlang::hash()` function generates a 128-bit hash, which means there are `2^128` possible hash values. This is great for some applications but doesn't help with feature hashing of *high cardinality* variables (variables with many levels). In feature hashing, the number of possible hashes is a hyperparameter and is set by the model developer through computing the modulo of the integer hashes. We can get sixteen possible hash values by using `Hash %% 16`:
```{r}
ames_hashed %>%
## first make a smaller hash for integers that R can handle
mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L),
## now take the modulo
Hash = Hash %% 16) %>%
select(Neighborhood, Hash)
```
Now instead of the `r n_distinct(ames_train$Neighborhood)` neighborhoods in our original data or an incredibly huge number of the original hashes, we have sixteen hash values. This method is very fast and memory efficient, and it can be a good strategy when there are a large number of possible categories.
:::rmdnote
Feature hashing is useful for text data as well as high cardinality categorical data. See Section 6.7 of @Hvitfeldt2021 for a case study demonstration with text predictors.
:::
We can implement feature hashing using a tidymodels recipe step from the `r pkg(textrecipes)` package:
```{r}
library(textrecipes)
ames_hash <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_dummy_hash(Neighborhood, signed = FALSE, num_terms = 16L) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_hash
```
Feature hashing is fast and efficient but has a few downsides. For example, different category values often map to the same hash value. This is called a _collision_ or _aliasing_. How often did this happen with our neighborhoods in Ames? Table \@ref(tab:encoding-hash) presents the distribution of number of neighborhoods per hash value.
```{r encoding-hash, echo=FALSE}
hash_table <-
prep(ames_hash) %>%
bake(new_data = NULL, starts_with("dummyhash_Neighborhood_")) %>%
bind_cols(ames_train %>% select(Neighborhood)) %>%
distinct() %>%
select(-Neighborhood) %>%
map_dbl(sum) %>%
enframe() %>%
count(value)
hash_table %>%
kable(col.names = c("Number of neighborhoods within a hash feature",
"Number of occurrences"),
caption = "The number of hash features at each number of neighborhoods.",
label = "encoding-hash") %>%
kable_styling(full_width = FALSE)
```
The number of neighborhoods mapped to each hash value varies between `r xfun::numbers_to_words(min(hash_table$value))` and `r xfun::numbers_to_words(max(hash_table$value))`. All of the hash values greater than one are examples of hash collisions.
What are some things to consider when using feature hashing?
- Feature hashing is not directly interpretable because hash functions cannot be reversed. We can't determine what the input category levels were from the hash value, or if a collision occurred.
- The number of hash values is a _tuning parameter_ of this preprocessing technique, and you should try several values to determine what is best for your particular modeling approach. A lower number of hash values results in more collisions, but a high number may not be an improvement over your original high cardinality variable.
- Feature hashing can handle new category levels at prediction time, since it does not rely on pre-determined dummy variables.
- You can reduce hash collisions with a _signed_ hash by using `signed = TRUE`. This expands the values from only 1 to either +1 or -1, depending on the sign of the hash.
:::rmdwarn
It is likely that some hash columns will contain all zeros, as we see in this example. We recommend a zero-variance filter via `step_zv()` to filter out such columns.
:::
## More Encoding Options
Even more options are available for transforming factors to a numeric representation.
We can build a full set of *entity embeddings* [@Guo2016] to transform a categorical variable with many levels to a set of lower-dimensional vectors. This approach is best suited to a nominal variable with many category levels, many more than the example we've used with neighborhoods in Ames.
:::rmdnote
The idea of entity embeddings comes from the methods used to create word embeddings from text data. See Chapter 5 of @Hvitfeldt2021 for more on word embeddings.
:::
Embeddings for a categorical variable can be learned via a TensorFlow neural network with the `step_embed()` function in `r pkg(embed)`. We can use the outcome alone or optionally the outcome plus a set of additional predictors. Like in feature hashing, the number of new encoding columns to create is a hyperparameter of the feature engineering. We also must make decisions about the neural network structure (the number of hidden units) and how to fit the neural network (how many epochs to train, how much of the data to use for validation in measuring metrics).
Yet one more option available for dealing with a binary outcome is to transform a set of category levels based on their association with the binary outcome. This *weight of evidence* (WoE) transformation [@Good1985] uses the logarithm of the "Bayes factor" (the ratio of the posterior odds to the prior odds) and creates a dictionary mapping each category level to a WoE value. WoE encodings can be determined with the `step_woe()` function in `r pkg(embed)`.
## Chapter Summary {#categorical-summary}
In this chapter, you learned about using preprocessing recipes for encoding categorical predictors. The most straightforward option for transforming a categorical variable to a numeric representation is to create dummy variables from the levels, but this option does not work well when you have a variable with high cardinality (too many levels) or when you may see novel values at prediction time (new levels). One option in such a situation is to create _effect encodings_, a supervised encoding method that uses the outcome. Effect encodings can be learned with or without pooling the categories. Another option uses a _hashing_ function to map category levels to a new, smaller set of dummy variables. Feature hashing is fast and has a low-memory footprint. Other options include entity embeddings (learned via a neural network) and weight of evidence transformation.
Most model algorithms require some kind of transformation or encoding of this type for categorical variables. A minority of models, including those based on trees and rules, can handle categorical variables natively and do not require such encodings.