-
Notifications
You must be signed in to change notification settings - Fork 115
/
02_data_preparation.Rmd
2707 lines (1638 loc) · 116 KB
/
02_data_preparation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# Data Preparation {#data_preparation}
## Handling Data Types {#data_types}
### What is this about?
One of the first things to do when we start a data project is to assign the correct data type for each variable. Although this seems a straightforward task, some algorithms work with certain data types. Here, we'll try to cover these conversions while explaining with examples the implications in each case.
```{r Fibonnacci-curve, echo=FALSE, out.height="150px", fig.cap="Fibonacci spiral", out.extra=''}
knitr::include_graphics("data_preparation/fibonacci_spiral.jpg")
```
_The Fibonacci series. A sequence of numbers present in nature and human bodies._
<br>
**What are we going to review in this chapter?**
* Detecting the correct data type
* How to convert from categorical to numerical
* How to convert from numerical to categorical (discretization methods)
* Theoretical and practical aspects (examples in R)
* How a predictive model looks at numerical variables
<br>
### The universe of data types
There are two main data types, **numerical** and **categorical**. Other names for categorical are **string** and **nominal**.
A subset of categorical is the ordinal or, as it is named in R, an **ordered** factor. At least in R, this type is only relevant when plotting categories in a certain order. An example in R:
```{r}
# Creating an ordinal or ordered factor
var_factor=factor(c("3_high", "2_mid", "1_low"))
var_ordered=factor(var_factor, ordered = T)
var_ordered
```
Don't pay too much attention to this data type as numerical and categorical are the most needed.
<br>
#### Binary variable, numerical, or categorical?
This book suggests using binary variables as numeric when `0` is `FALSE` and `1` is `TRUE.` This makes it easier to profile data.
<br>
### Data types per algorithm
Some algorithms work as follows:
* `r emo::ji("bar_chart")` Only with categorical data
* `r emo::ji("straight_ruler")` Only with numerical data
* `r emo::ji("bar_chart")``r emo::ji("straight_ruler")` With both types
Moreover, not every predictive model can handle **missing value**.
The **Data Science Live Book** tries to cover all of these situations.
<br>
### Converting categorical variables into numerical
Using the `caret` package in R is a straightforward task that converts every categorical variable into a **flag one**, also known as a _dummy_ variable.
If the original categorical variable has thirty possible values, then it will result in 30 new columns holding the value `0` or `1`, where `1` represents the presence of that category in the row.
If we use the caret package from R, then this conversion only takes two lines of code:
```{r, message=FALSE, warning=FALSE}
library(caret) # contains dummyVars function
library(dplyr) # data munging library
library(funModeling) # df_status function
# Checking categorical variables
status=df_status(heart_disease, print_results = F)
filter(status, type %in% c("factor", "character")) %>% select(variable)
# It converts all categorical variables (factor and character) into numerical variables
# It skips the original variable, so no need to remove it after the conversion, the data is ready to use.
dmy = dummyVars(" ~ .", data = heart_disease)
heart_disease_2 = data.frame(predict(dmy, newdata = heart_disease))
# Checking the new numerical data set:
colnames(heart_disease_2)
```
```{r, message=FALSE, echo=FALSE}
detach("package:caret")
#detach("package:plyr")
```
Original data `heart_disease` has been converted into `heart_disease_2` with no categorical variables, only numerical and dummy. Note that every new variable has a _dot_ followed by the _value_.
If we check the before and after for the 7th patient (row) in variable `chest_pain` which can take the values `1`, `2`, `3` or `4`, then
```{r, message=FALSE}
# before
as.numeric(heart_disease[7, "chest_pain"])
# after
heart_disease_2[7, c("chest_pain.1", "chest_pain.2", "chest_pain.3", "chest_pain.4")]
```
Having kept and transformed only numeric variables while excluding the nominal ones, the data `heart_disease_2` are ready to be used.
More info about `dummyVars`: http://amunategui.github.io/dummyVar-Walkthrough/
<br>
### Is it categorical or numerical? Think about it.
Consider the `chest_pain` variable, which can take values `1`, `2`, `3`, or `4`. Is this variable categorical or numerical?
If the values are ordered, then it can be considered as numerical as it exhibits an **order** i.e., 1 is less than 2, 2 is less than 3, and 3 is less than 4.
If we create a decision tree model, then we may find rules like: "`If chest_pain > 2.5, then...`". Does it make sense? The algorithm splits the variable by a value that is not present (`2.5`); however, the interpretation by us is "if `chest_pain` is equal or higher than 3, then…”.
<br>
#### Thinking as an algorithm
Consider two numerical input variables and a target binary variable. The algorithm will _see_ both input variables as dots in a rectangle, considering that there are infinite values between each number.
For example, a **Supported Vector Machine** (SVM) will create _several_ vectors in order to separate the target variable class. It will **find regions** based on these vectors. How would it be possible to find these regions based on categorical variables? It isn't possible and that's why SVM only supports numerical variables as with artificial neural networks.
```{r Support-Vector-Machine, echo=FALSE, out.width="200px", fig.cap="Support Vector Machine", out.extra=''}
knitr::include_graphics("data_preparation/svm.png")
```
_Image credit: ZackWeinberg_
The last image shows three lines, representing three different decision boundaries or regions.
For a quick introduction to this SVM concept, please go to this short video: [SVM Demo](https://www.youtube.com/watch?v=1NxnPkZM9bc).
However, if the model is tree-based, like decision trees, random forest, or gradient boosting machine, then they handle both types because their search space can be regions (same as SVM) and categories. Like the rule "`if postal_code is AX441AG and age > 55, then...`".
Going back to the heart disease example, the variable `chest_pain` exhibits order. We should take advantage of this because if we convert this to a categorical variable, then **we are losing information** and this is an important point when handling data types.
<br>
#### Is the solution to treat all as categorical?
No... A numerical variable carries more information than a nominal one because of its order. In categorical variables, the values cannot be compared. Let's say it's not possible to make a rule like `If postal code is higher than "AX2004-P"`.
The values of a nominal variable can be compared if we have another variable to use as a reference (usually an outcome to predict).
For example, postal code "AX2004-P" is _higher_ than "MA3942-H" because there are more people interested in attending photography lessons.
In addition, **high cardinallity** is an issue in categorical variables, e.g., a `postal code` variable containing hundreds of different values. This book has addressed this in both chapters: handling high categorical variable for [descriptive statistics](#high_cardinality_descriptive_stats) and when we do [predictive modelling](#high_cardinality_predictive_modeling).
Anyway, you can do the _free test_ of converting all variables into categorical ones and see what happens. Compare the results with the numerical variables. Remember to use some good error measure for the test, like Kappa or ROC statistic, and to cross-validate the results.
<br>
#### Be aware when converting categorical into numerical variables
Imagine we have a categorical variable that we need to convert to numerical. As in the previous case, but trying a different **transformation** assign a different number to each category.
We have to be careful when doing such transformations because we are **introducing order** to the variable.
Consider the following data example having four rows. The first two variables are `visits` and `postal_code` (this works as either two input variables or `visits` as input and `postal_code` as output).
The following code will show the `visits` depending on `postal_code` transformed according to two criteria:
* `transformation_1`: Assign a sequence number based on the given order.
* `transformation_2`: Assign a number based on the number of `visits`.
```{r, data-types-in-machine-learning, warning=FALSE, fig.width=9, fig.height=3, message=FALSE, fig.cap="Data transformations comparison", out.extra=''}
# creating data -toy- sample
df_pc=data.frame(visits=c(10, 59, 27, 33), postal_code=c("AA1", "BA5", "CG3", "HJ1"), transformation_1=c(1,2,3,4), transformation_2=c(1, 4, 2, 3 ))
# printing table
knitr::kable(df_pc)
library(gridExtra)
# transformation 1
plot_1=ggplot(df_pc, aes(x=transformation_1, y=visits, label=postal_code)) + geom_point(aes(color=postal_code), size=4)+ geom_smooth(method=loess, group=1, se=FALSE, color="lightblue", linetype="dashed") + theme_minimal() + theme(legend.position="none") + geom_label(aes(fill = factor(postal_code)), colour = "white", fontface = "bold")
# transformation 2
plot_2=ggplot(df_pc, aes(x=transformation_2, y=visits, label=postal_code)) + geom_point(aes(color=postal_code), size=4)+ geom_smooth(method=lm, group=1, se=FALSE, color="lightblue", linetype="dashed") + theme_minimal() + theme(legend.position="none") + geom_label(aes(fill = factor(postal_code)), colour = "white", fontface = "bold")
# arranging plots side-by-side
grid.arrange(plot_1, plot_2, ncol=2)
```
To be sure, nobody builds a predictive model using only four rows; however, the intention of this example is to show how the relationship changes from non-linear (`transformation_1`) to linear (`transformation_2`). This makes things easier for the predictive model and explains the relationship.
This effect is the same when we handle millions of rows of data and the number of variables scales to hundreds. Learning from small data is a right approach in these cases.
<br>
### Discretizing numerical variables {#discretizing_numerical_variables}
This process converts data into one category by splitting it into bins. For a fancy definition, we can quote _Wikipedia_: _Discretization concerns the process of transferring continuous functions, models, and equations into discrete counterparts._
Bins are also known as buckets or segments. Let's continue with the examples.
#### About the data
The data contain information regarding the percentage of children that are stunted. The ideal value is zero.
> The indicator reflects the share of children younger than 5 years who suffer from stunting. Children with stunted growth are at greater risk for illness and death.
Data source: [ourworldindata.org, hunger and undernourishment](https://ourworldindata.org/hunger-and-undernourishment/#undernourishment-of-children).
First of all, we have to do a quick **data preparation**. Each row represents a country–year pair, so we have to obtain the most recent indicator per country.
```{r equal-range-discretization-a, tidy=FALSE}
data_stunting=read.csv(file = "https://goo.gl/hFEUfN",
header = T,
stringsAsFactors = F)
# renaming the metric
data_stunting=
dplyr::rename(
data_stunting,
share_stunted_child=
Share.of.stunted.children.under.5
)
# doing the grouping mentioned before
d_stunt_grp = group_by(data_stunting, Entity) %>%
filter(Year == max(Year)) %>%
dplyr::summarise(share_stunted_child=
max(share_stunted_child)
)
```
The most standard binning criteria are:
* Equal range
* Equal frequency
* Custom bins
There are all explained below.
<br>
#### Equal range
The range is commonly found in histograms looking at distribution, but is highly susceptible to outliers. To create, for example, four bins, requires the min and max values divided by 4.
```{r, equal-range-discretization, fig.width=4, fig.height=3, tidy=FALSE, fig.cap="Equal frequency discretization", out.extra=''}
# funModeling contains equal_freq (discretization)
library(funModeling)
# ggplot2 it provides 'cut_interval' function used to
# split the variables based on equal range criteria
library(ggplot2)
# Creating equal range variable, add `dig.lab=9`
# parameter to deactivate scientific notation as with
# the `cut` function.
d_stunt_grp$share_stunted_child_eq_range=
cut_interval(d_stunt_grp$share_stunted_child, n = 4)
# The ‘describe’ function from Hmisc package is
# extremely useful to profile data
describe(d_stunt_grp$share_stunted_child_eq_range)
# Plotting the variable
p2=ggplot(d_stunt_grp,
aes(share_stunted_child_eq_range)
) +
geom_bar(fill="#009E73") +
theme_bw()
p2
```
The `describe` output tells us that there are four categories in the variable and, between parenthesis/square bracket, the total number of cases per category in both absolute and relative values, respectively. For example, the category `(15.8,30.3]` contains all the cases that have `share_stunted_child` from `15.8` (not inclusive) to `30.3` (inclusive).
It appears `45` times and represents `29%` of total cases.
<br>
#### Equal frequency {#equal-frequency-binning}
This technique groups the same number of observations using criteria based on percentiles. More information about percentiles at [Annex 1: The magic of percentiles](#appendix-percentiles) chapter.
The `funModeling` package includes the `equal_freq` function to create bins based on these criteria:
```{r equal-frequency-discretization, fig.width=4, fig.height=3, tidy=FALSE, fig.cap="Equal frequency example", out.extra=''}
d_stunt_grp$stunt_child_ef=
equal_freq(var = d_stunt_grp$share_stunted_child,
n_bins = 4
)
# profiling variable
describe(d_stunt_grp$stunt_child_ef)
p3=ggplot(d_stunt_grp, aes(stunt_child_ef)) +
geom_bar(fill="#CC79A7") + theme_bw()
p3
```
In this case, we select four bins so that each bin will contain an approximate 25% share.
<br>
#### Custom bins
If we already have the points for which we want the segments, we can use the `cut` function.
```{r discretization-custom-bins, fig.width=4, fig.height=3, tidy=FALSE, fig.cap="Manual discretization", out.extra=''}
# parameter dig.lab "disable" scientific notation
d_stunt_grp$share_stunted_child_custom=
cut(d_stunt_grp$share_stunted_child,
breaks = c(0, 2, 9.4, 29, 100)
)
describe(d_stunt_grp$share_stunted_child_custom)
p4=ggplot(d_stunt_grp, aes(share_stunted_child_custom)) +
geom_bar(fill="#0072B2") +
theme_bw()
p4
```
Please note it’s only needed to define the maximum value per bucket.
In general, we don’t know the minimum nor maximum value. In those cases, we can use the values `-Inf` and `Inf`. Otherwise, if we define a value out of the range, `cut` will assign the `NA` value.
It's good practice to assign the minimum and maximum using a function. In this case, the variable is a percentage, so we know beforehand its scale is from 0 to 100; however, `r emo::ji("warning")` _what would happen if we did not know the range?_
The function will return `NA` for those values below or above the cut points. One solution is to get variable min and max values:
```{r, tidy=FALSE}
# obtaining the min and max
min_value=min(d_stunt_grp$share_stunted_child)
max_value=max(d_stunt_grp$share_stunted_child)
# set `include.lowest=T` to include the min value,
# otherwise it will be assigned as NA.
d_stunt_grp$share_stunted_child_custom_2=
cut(d_stunt_grp$share_stunted_child,
breaks = c(min_value, 2, 9.4, 29, max_value),
include.lowest = T)
describe(d_stunt_grp$share_stunted_child_custom_2)
```
<br>
### Discretization with new data
All of these transformations are made given a training dataset based on the variables’ distributions. Such is the case of equal frequency and equal range discretization. _But what would it happen if new data arrive?_
If a new min or max value appears, then it will affect the bin range in the **equal range** method.
If any new value arrives, then it will move the points based on percentiles as we saw in the **equal frequency** method.
As an example, imagine that in the proposed example we add four more cases with values `88`, `2`, `7` and `3`:
```{r}
# Simulating that four new values arrive
updated_data=c(d_stunt_grp$share_stunted_child, 88, 2, 7, 3)
# discretization by equal frequency
updated_data_eq_freq=equal_freq(updated_data,4)
# results in...
describe(updated_data_eq_freq)
```
Now we compare with the bins we created before:
```{r}
describe(d_stunt_grp$stunt_child_ef)
```
**All the bins changed!** `r emo::ji("scream")` Because these are new categories, the predictive model will fail to handle them because they are all new values.
The solution is to save the cut points when we do data preparation. Then, when we run the model on production, we use the custom bin discretization and, thereby, force every new case in the proper category. This way, the predictive model will always _sees_ the same.
The solution in next section.
<br>
### Automatic data frame discretization {#data-discretization}
The package `funModeling` (from version > 1.6.6) introduces two functions— `discretize_get_bins` & `discretize_df` —that work together in order to help us in the discretization task.
```{r, warning=FALSE, message=FALSE}
# First we load the libraries
# install.packages("funModeling")
library(funModeling)
library(dplyr)
```
Let's see an example. First, we check current data types:
```{r}
df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type)
```
We've got factor, integer, and numeric variables: a good mix! The transformation has two steps. First, it gets the cuts or threshold values from which each segment begins. The second step is using the threshold to obtain the variables as categoricals.
Two variables will be discretized in the following example: `max_heart_rate` and `oldpeak`. Also, we'll introduce some `NA` values into `oldpeak` to test how the function works with missing data.
```{r}
# creating a copy to keep original data clean
heart_disease_2=heart_disease
# Introducing some missing values in the first 30 rows of the oldpeak variable
heart_disease_2$oldpeak[1:30]=NA
```
Step 1) Getting the bin thresholds for each input variable:
`discretize_get_bins` returns a data frame that needs to be used in the `discretize_df` function, which returns the final processed data frame.
```{r}
d_bins=discretize_get_bins(data=heart_disease_2, input=c("max_heart_rate", "oldpeak"), n_bins=5)
# Checking `d_bins` object:
d_bins
```
Parameters:
- `data`: the data frame containing the variables to be processed.
- `input`: vector of strings containing the variable names.
- `n_bins`: the number of bins/segments to have in the discretized data.
We can see each threshold point (or upper boundary) for each variable.
**Note**: Changes from version 1.6.6 to 1.6.7:
* `discretize_get_bins` doesn't create the `-Inf` threshold since that value was always considered to be the minimum.
* The one value category now it is represented as a range, for example, what it was `"5"`, now it is `"[5, 6)"`.
* Buckets formatting may have changed, if you were using this function in production, you would need to check the new values.
Time to continue with next step!
Step 2) Applying the thresholds for each variable:
```{r, tidy=FALSE}
# Now it can be applied on the same data frame or in
# a new one (for example, in a predictive model that
# changes data over time)
heart_disease_discretized =
discretize_df(data=heart_disease_2,
data_bins=d_bins,
stringsAsFactors=T)
```
Parameters:
- `data`: data frame containing the numerical variables to be discretized.
- `data_bins`: data frame returned by `discretize_get_bins`. If it is changed by the user, then each upper boundary must be separated by a pipe character (`|`) as shown in the example.
- `stringsAsFactors`: `TRUE` by default, final variables will be factor (instead of a character) and useful when plotting.
#### Final results and their plots
Before and after:
```{r, echo=FALSE}
df_compare=data.frame(max_heart_rate_before=heart_disease_2$max_heart_rate[29:34], max_heart_rate_after=heart_disease_discretized$max_heart_rate[29:34], oldpeak_before=heart_disease_2$oldpeak[29:34], oldpeak_after=heart_disease_discretized$oldpeak[29:34])
df_compare
```
Final distribution:
```{r, binning-variable, tidy=FALSE, fig.cap="Automatic discretization results", out.extra=''}
freq(heart_disease_discretized %>%
select(max_heart_rate,oldpeak),
plot = F)
p5=ggplot(heart_disease_discretized,
aes(max_heart_rate)) +
geom_bar(fill="#0072B2") +
theme_bw() +
theme(axis.text.x =
element_text(angle = 45, vjust = 1, hjust=1)
)
p6=ggplot(heart_disease_discretized,
aes(oldpeak)) +
geom_bar(fill="#CC79A7") +
theme_bw() +
theme(axis.text.x =
element_text(angle = 45, vjust = 1, hjust=1)
)
gridExtra::grid.arrange(p5, p6, ncol=2)
```
Sometimes, it is not possible to get the same number of cases per bucket when computing **equal frequency** as is shown in the `oldpeak` variable.
#### NA handling
Regarding the `NA` values, the new `oldpeak` variable has six categories: five categories defined in `n_bins=5` plus the `NA.` value. Note the point at the end indicating the presence of missing values.
#### More info
- `discretize_df` will never return an `NA` value without transforming it to the string `NA.`.
- `n_bins` sets the number of bins for all the variables.
- If `input` is missing, then it will run for all numeric/integer variables whose number of unique values is greater than the number of bins (`n_bins`).
- Only the variables defined in `input` will be processed while remaining variables will **not be modified at all**.
- `discretize_get_bins` returns just a data frame that can be changed by hand as needed, either in a text file or in the R session.
#### Discretization with new data
In our data, the minimum value for `max_heart_rate` is 71. The data preparation must be robust with new data; e.g., if a new patient arrives whose `max_heart_rate` is 68, then the current process will assign her/him to the lowest category.
In other functions from other packages, this preparation may return an `NA` because it is out of the segment.
As we pointed out before, if new data comes over time, it's likely to get new min/max value/s. This can break our process. To solve this, `discretize_df` will always have as min/max the values `-Inf`/`Inf`; thus, any new value falling below/above the minimum/maximum will be added to the lowest or highest segment as applicable.
The data frame returned by `discretize_get_bins` must be saved in order to apply it to new data. If the discretization is not intended to run with new data, then there is no sense in having two functions: it can be only one. In addition, there would be no need to save the results of `discretize_get_bins`.
Having this two-step approach, we can handle both cases.
#### Conclusions about two-step discretization
The usage of `discretize_get_bins` + `discretize_df` provides quick data preparation, with a clean data frame that is ready to use. Clearly showing where each segment begin and end, indispensable when making statistical reports.
The decision of _not fail_ when dealing with a new min/max in new data is **just a decision**. In some contexts, failure would be the desired behavior.
**The human intervention**: The easiest way to discretize a data frame is to select the same number of bins to apply to every variable—just like the example we saw—however, if tuning is needed, then some variables may need a **different number of bins**. For example, a variable with less dispersion can work well with a low number of bins.
Common values for the number of segments could be 3, 5, 10, or 20 (but no more). It is up to the data scientist to make this decision.
<br>
#### Bonus track: The trade-off art `r emo::ji("balance_scale")`
- A high number of bins => More noise captured.
- A low number of bins => Oversimplification, less variance.
Do these terms sound similar to any other ones in machine learning?
The answer: **Yes!**. Just to mention one example: the trade-off between adding or subtracting variables from a predictive model.
- More variables: Overfitting alert (too detailed predictive model).
- Fewer variables: Underfitting danger (not enough information to capture general patterns).
_Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite._
<br>
### Final thoughts
As we can see, **there is no free lunch** in discretization or data preparation. How do you think that an _automatic or intelligent system_ will handle all of these situations without human intervention or analysis?
To be sure, we can delegate some tasks to automatic processes; however, **humans are indispensable in data preparation stage**, giving the correct input data to process.
The assignment of variables as categorical or numerical, the two most used data types varies according to the nature of the data and the selected algorithms as some only support one data type.
The conversion **introduces some bias** to the analysis. A similar case exists when we deal with missing values: [Handling and Imputation of Missing Data](#missing_data).
When we work with categorical variables, we can change their distribution by re-arranging the categories according to a target variable in order to **better expose their relationship**. Converting a non-linear variable relationship, into one linear.
<br>
### Bonus track `r emo::ji("boom")`
Let's go back to the discretization variable section and plot all the transformations we've seen so far:
```{r discretization-methods, fig.width=9, fig.height=4, fig.cap="Same data, different visualizations", out.extra=''}
grid.arrange(p2, p3, p4, ncol = 3)
```
The input data is always the same. However, all of these methods **exhibit different perspectives of the same _thing_**.
Some perspectives are more suitable than others for certain situations, such as the use of **equal frequency** for **predictive modeling**.
Although this case is only considering one variable, the reasoning is the same if we have several variables at once, i.e., an `N-dimensional` space.
When we build predictive models, we describe the same bunch of points in different ways as when people give an opinion regarding some object.
<br>
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## High Cardinality Variable in Descriptive Stats {#high_cardinality_descriptive_stats}
### What is this about?
A **high cardinality** variable is one in which it can take _many_ different values. For example country.
This chapter will cover cardinality reduction based on Pareto rule, using the `freq` function which gives a quick view about where the most of values are concentrated and variable distribution.
<br>
### High Cardinality in Descriptive Statistics
The following example contains a survey of 910 cases, with 3 columns: `person`, `country` and `has_flu`, which indicates having such illness in the last month.
```{r , message=F}
library(funModeling)
```
`data_country` data comes inside `funModeling` package (please update to release 1.6).
Quick `data_country` profiling (first 10 rows)
```{r data-preparation-nominal-variable, fig.height=9, fig.width=5, dpi=300, fig.cap="Country frequency analysis", out.extra=''}
# plotting first 10 rows
head(data_country, 10)
# exploring data, displaying only first 10 rows
head(freq(data_country, "country"), 10)
```
```{r data-preparation-nominal-variable-2, fig.height=1.5, fig.width=5, fig.cap="Has flue frequency analysis", out.extra=''}
# exploring data
freq(data_country, "has_flu")
```
<br>
The last table shows there are only 83 rows where `has_flu="yes"`, representing around a 9% of total people.
But many of them have almost no participation in the data. This is the _long tail_, so one technique to reduce cardinality is to keep those categories that are present in a high percentage of data share, for example 70, 80 or 90%, the Pareto principle.
```{r data-preparation-profiling-nominal-variable}
# 'freq' function, from 'funModeling' package, retrieves the cumulative_percentage that will help to do the cut.
country_freq=freq(data_country, 'country', plot = F)
# Since 'country_freq' is an ordered table by frequency, let's inspect the first 10 rows with the most share.
country_freq[1:10,]
```
<br>
So 10 countries represent more the 70% of cases. We can assign the category `other` to the remaining cases and plot:
```{r data-preparation-profiling-nominal-variable-2, fig.height=2.3, fig.width=4.2, fig.cap="Modified country variable - frequency analysis", out.extra=''}
data_country$country_2=ifelse(data_country$country %in% country_freq[1:10,'country'], data_country$country, 'other')
freq(data_country, 'country_2')
```
<br>
### Final comments
Low representative categories are sometimes errors in data, such as having: `Egypt`, `Eggypt.`, and may give some evidence in bad habbits collecting data and/or possible errors when collecting from the source.
There is no general rule to shrink data, it depends on each case.
<br>
**Next recommended chapter:** High Cardinality Variable in Predictive Modeling.
<br>
---
```{r, echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## High Cardinality Variable in Predictive Modeling {#high_cardinality_predictive_modeling}
### What is this about?
As we've seen in the last chapter, _High Cardinality in Descriptive Statistics_, we keep the categories with the major representativeness, but how about having another variable to predict with it? That is, to predict `has_flu` based on `country`.
Using the last method may destroy the information of the variable, thus it **loses predictive power**. In this chapter we'll go further in the method described above, using an automatic grouping function -`auto_grouping`- surfing through the variable's structure, giving some ideas about how to optimize a categorical variable, but more importantly: encouraging the reader to perform her-his own optimizations.
Other literature named this re-grouping as cardinality reduction or **encoding**.
<br>
**What are we going to review in this chapter?**
* Concept of representativeness of data (sample size).
* Sample size having a target or outcome variable.
* From R: Present a method to help reduce cardinality and profiling categoric variable.
* A practical before-and-after example reducing cardinality and insights extraction.
* How different models such as random forest or a gradient boosting machine deals with categorical variables.
<br>
### But is it necessary to re-group the variable?
It depends on the case, but the quickest answer is yes. In this chapter we will see one case in which this data preparation increases overall accuracy (measuring by the Area Under Roc Curve).
There is a tradeoff between the **representation of the data** (how many rows each category has), and how is each category related to the outcome variable. E.g.: some countries are more prone to cases of flu than others
```{r , message=FALSE, warning=FALSE}
# Loading funModeling >=1.6 which contains functions to deal with this.
library(funModeling)
library(dplyr)
```
Profiling `data_country`, which comes inside `funModeling` package (please update to release > 1.6.5).
Quick `data_country` profiling (first 10 rows)
```{r high-cardinality-variable, fig.height=11, fig.width=6, dpi=200, fig.cap="First 10 countries", out.extra=''}
# plotting first 10 rows
head(data_country, 10)
# exploring data, displaying only first 10 rows
head(freq(data_country, "country"), 10)
```
```{r data-preparation-high-cardinality-variable, fig.height=2, fig.width=5.3, fig.cap="Has flu distribution", out.extra=''}
# exploring data
freq(data_country, "has_flu")
```
<br>
### The case `r emo::ji("mag")` {#analyze-binary-outcome-vs-categorical-variable}
The predictive model will try to map certain values with certain outcomes, in our case the target variable is binary.
We'll computed a complete profiling of `country` regarding the target variable `has_flu` based on `categ_analysis`.
Each row represent an unique category of `input` variables. Withing each row you can find attributes that define each category in terms of representativeness and likelihood.
```{r, results='hide'}
# `categ_analysis` is available in "funModeling" >= v1.6, please install it before using it.
country_profiling=categ_analysis(data=data_country, input="country", target = "has_flu")
# Printing first 15 rows (countries) out of 70.
head(country_profiling, 15)
```
```{r data-preparation-in-predictivemodeling, echo=FALSE ,out.width="75%", fig.cap="Analyzing target vs. input", out.extra=''}
knitr::include_graphics("data_preparation/country_profiling.png")
```
<br>
* Note 1: _The first column automatically adjusts its name based on `input` variable_
* Note 2: _`has_flu` variable has values `yes` and `no`, `categ_analysis` assigns internally the number **1** to the less representative class, `yes` in this case, in order to calculate the mean, sum and percentage._
These are the metrics returned by `categ_analysis`:
* `country`: name of each category in `input` variable.
* `mean_target`: `sum_target/q_rows`, average number of `has_flu="yes"` for that category. This is the likelihood.
* `sum_target`: quantity of `has_flu="yes"` values are in each category.
* `perc_target`: the same as `sum_target` but in percentage, `sum_target of each category / total sum_target`. This column sums `1.00`.
* `q_rows`: quantity of rows that, regardless of the `has_flu` variable, fell in that category. It's the distribution of `input`. This column sums the total rows analyzed.
* `perc_rows`: related to `q_rows` it represents the share or percentage of each category. This column sums `1.00`
<br>
#### What conclusions can we draw from this?
Reading example based on 1st `France` row:
* 41 people have flu (`sum_target=41`). These 41 people represent almost 50% of the total people having flu (`perc_target=0.494`).
* Likelihood of having flu in France is 14.2% (`mean_target=0.142`)
* Total rows from France=288 -out of 910-. This is the `q_rows` variable; `perc_rows` is the same number but in percentage.
Without considering the filter by country, we've got:
* Column `sum_target` sums the total people with flu present in data.
* Column `perc_target` sums `1.00` -or 100%
* Column `q_rows` sums total rows present in `data_country` data frame.
* Column `perc_rows` sums `1.00` -or 100%.
<br>
---
### Analysis for Predictive Modeling `r emo::ji("crystal_ball")`
When developing predictive models, we may be interested in those values which increases the likelihood of a certain event. In our case:
**What are the countries that maximize the likelihood of finding people with flu?**
Easy, take `country_profiling` in a descending order by `mean_target`:
```{r}
# Ordering country_profiling by mean_target and then take the first 6 countries
arrange(country_profiling, -mean_target) %>% head(.)
```
<br>
Great! We've got `Malasyia` as the country with the highest likelihood to have flu! 100% of people there have flu (`mean_has_flu=1.000`).
But our common sense advises us that _perhaps_ something is wrong...
How many rows does Malasya have? Answer: 1. -column: `q_rows=1`
How many positive cases does Malasya have? Answer: 1 -column: `sum_target=1`
Since the sample cannot be increased see if this proportion stays high, it will contribute to **overfit** and create a bias on the predictive model.
How about `Mexico`? 2 out of 3 have flu... it still seems low. However `Uruguay` has 17.3% likelihood -11 out of 63 cases- and these 63 cases represents almost 7% of total population (`perc_row=0.069`), this ratio seems more credible.
Next there are some ideas to treat this:
---
#### Case 1: Reducing by re-categorizing less representative values {#reduce_cardinality_by_re_categorizing_less_representative_values}
Keep all cases with at least certain percentage of representation in data. Let's say to rename the countries that have less than 1% of presence in data to `others`.
```{r}
country_profiling=categ_analysis(data=data_country, input="country", target = "has_flu")
countries_high_rep=filter(country_profiling, perc_rows>0.01) %>% .$country
# If not in countries_high_rep then assign `other` category
data_country$country_new=ifelse(data_country$country %in% countries_high_rep, data_country$country, "other")
```
Checking again the likelihood:
```{r}
country_profiling_new=categ_analysis(data=data_country, input="country_new", target = "has_flu")
country_profiling_new
```
We've reduced the quantity of countries drastically -**74% less**- only by shrinking the less representative at 1%. Obtaining 18 out of 70 countries.
Likelihood of target variable has been stabilised a little more in `other` category. Now when the predictive model _sees_ `Malasya` it will **not assign 100% of likelihood, but 4.1%** (`mean_has_flu=0.041`).
**Advice about this last method:**
Watch out about applying this technique blindly. Sometimes in a **highly unbalanced** target prediction -e.g. **anomaly detection**- the abnormal behavior is present in less than 1% of cases.
```{r anomaly-data-analysis, fig.cap="Inspecting anomaly data", out.extra=''}
# replicating the data
d_abnormal=data_country
# simulating abnormal behavior with some countries
d_abnormal$abnormal=ifelse(d_abnormal$country %in% c("Brazil", "Chile"), 'yes', 'no')
# categorical analysis
ab_analysis=categ_analysis(d_abnormal, input = "country", target = "abnormal")
# displaying only first 6 elements
head(ab_analysis)
# inspecting distribution, just a few belongs to 'no' category
freq(d_abnormal, "abnormal", plot = F)
```
_How many abnormal values are there?_
Only 15, and they represent 1.65% of total values.
Checking the table returned by `categ_analysis`, we can see that this _abnormal behavior_ occurs **only** in categories with a really low participation: `Brazil` which is present in only 1.4% of cases, and `Chile` with 0.2%.
Creating a category `other` based on the distribution is not a good idea here.
**Conclusion:**
Despite the fact this is a prepared example, there are some data preparations techniques that can be really useful in terms of accuracy, but they need some supervision. This supervision can be helped by algorithms.
<br>
#### Case 2: Reducing by automatic grouping
This procedure uses the `kmeans` clustering technique and the table returned by `categ_analysis` in order to create groups -clusters- which contain categories which exhibit similar behavior in terms of:
* `perc_rows`
* `perc_target`
The combination of both will lead to find groups considering likelihood and representativeness.
**Hands on R:**
We define the `n_groups` parameter, it's the number of desired groups. The number is relative to the data and the number of total categories. But a general number would be between 3 and 10.
Function `auto_grouping` comes in `funModeling` >=1.6. Please note that the `target` parameter only supports for non binary variables.
_Note: the `seed` parameter is optional, but assigning a number will retrieve always the same results._
```{r}
# Reducing the cardinality
country_groups=auto_grouping(data = data_country, input = "country", target="has_flu", n_groups=9, seed = 999)
country_groups$df_equivalence
```
`auto_grouping` returns a list containing 3 objects:
* `df_equivalence`: data frame which contains a table to map old to new values.
* `fit_cluster`: k-means model used to reduce the cardinality (values are scaled).
* `recateg_results`: data frame containing the profiling of each group regarding target variable, first column adjusts its name to the input variable in this case we've got: `country_rec`. Each group correspond to one or many categories of the input variable (as seen in `df_equivalence`).
Let's explore how the new groups behave, this is what the predictive model will _see_:
```{r}
country_groups$recateg_results
```
Last table is ordered by `mean_target`, so we can quickly see groups maximizing and minimizing the likelihood.
* `group_2` is the most common, it is present in 31.6% of cases and `mean_target` (likelihood) is 14.2%.
* `group_8` has the highest likelihood (17.6%). Followed by `group_6` with chance of 15.6% of having a positive case (`has_flu="yes"`).
* `group_4`, `group_5` and `group_9` looks the same. They can be one group since likelihood is 0 in all the cases.
* `group_7` and `group_3` have 1 and 2 countries with positive cases. We could consider these numbers as the same, grouping them into one group, which in the end will represent the countries with the lowest likelihood.
First we need to add the new category column to the original dataset.
```{r, message=F}
data_country_2=data_country %>% inner_join(country_groups$df_equivalence, by="country")
```
Now we do the additional transformations replacing:
* `group_4`, `group_5` and `group_9` will be `low_likelihood`, (countries with no positive cases or low target share).
* `group_7` and `group_3` will be the `low_target_share`.
```{r, hide=TRUE, tidy=FALSE}
data_country_2$country_rec=
ifelse(data_country_2$country_rec %in%
c("group_4", "group_5", "group_9"),
"low_likelihood",
data_country_2$country_rec
)
data_country_2$country_rec=
ifelse(data_country_2$country_rec %in%
c("group_7", "group_3"),
"low_target_share",
data_country_2$country_rec
)
```
Checking the final grouping (`country_rec` variable):
```{r}
categ_analysis(data=data_country_2, input="country_rec", target = "has_flu")
```
Each group seems to have a good sample size regarding the `sum_target` distribution. Our transformation left `low_likelihood` with a representation of 21% of total cases, still with 0 positive cases (`sum_target`=0). And `low_target_share` with 3 positive cases, which represents 3.6% of positive cases.
All the groups seems to have a good representation. This can be checked in `perc_rows` variable. All cases are above of 7% share.
Trying a lower number of cluster may help to reduce this manual task a little. It was just a demonstration of how to optimize a variable having lots of different categories.
<br>
### Handling new categories when the predictive model is on production
Let's imagine a new country appears, `new_country_hello_world`, predictive models will fail since they were trained with fixed values. One technique is to assign a group which has `mean_target=0`.