-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.xml
949 lines (874 loc) · 69.8 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>R to the max</title>
<link>https://kwhkim.github.io/maxR/</link>
<description>Recent content on R to the max</description>
<generator>Hugo -- gohugo.io</generator>
<lastBuildDate>Wed, 24 Aug 2022 00:00:00 +0000</lastBuildDate><atom:link href="https://kwhkim.github.io/maxR/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Relative risk regression (1/2)</title>
<link>https://kwhkim.github.io/maxR/2022/08/24/relative-risk-regression/</link>
<pubDate>Wed, 24 Aug 2022 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/2022/08/24/relative-risk-regression/</guid>
<description>
<p>When the outcome variable is binary such as alive/dead or yes/no, the most popular analytic method is <strong>logistic regression</strong>.</p>
<p><span class="math display">\[\textrm{logit}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots \]</span></p>
<p>The name “<strong>logistic</strong>” might have come from the equation below, which can be derived from applying the inverse function of logit on the both side of the equation above.</p>
<p><span class="math display">\[ \mathbb{E}[y] = \textrm{logistic}( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)\]</span></p>
<p>The link function of the <strong>logistic</strong> regression is <code>logit()</code>. We can replace it with <code>log()</code> and the result looks like the below.</p>
<p><span class="math display">\[ \textrm{log}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \cdots \]</span>
This equation represents “<strong>Relative Risk Regression</strong>” a.k.a <strong>log-binomial regression</strong>.</p>
<div id="risk-relative-risk" class="section level2">
<h2>Risk, Relative Risk</h2>
<p><strong>Risk</strong> is just another term for probability. For instance, “the probability of being hit by a lightening” can be rephrased to “the <strong>risk</strong> of being hit by a lightening”.</p>
<p><strong>Relative risk</strong> or <strong>risk ratio(RR)</strong> is the ratio of two probability(risk). Relative risk is to compare the probabilities of two events. For example, compare the probability of being hit by a lightening when standing alone with the probability of being hit by a lightening having an umbrella open. If we divide the second probability by the first probability, we get how many times we are likely to be hit by a lightening when having an umbrella open compared to having nothing at all. This is <strong>relative risk</strong>, or <strong>risk ratio</strong>. If it is 2, in average we will get hit twice (with an umbrella open) every one hit (with nothing).</p>
<p>The name “<strong>Relative Risk</strong> Regression” seems to come from the fact that the coefficients of relative risk regression is closely related to relative risk! Let’s imagine a relative risk regression with only one predictor <span class="math inline">\(x\)</span> , which is <span class="math inline">\(1\)</span> for having an umbrella open, and <span class="math inline">\(0\)</span> for having nothing. We can compare <span class="math inline">\(y|x=0\)</span> and <span class="math inline">\(y|x=1\)</span> .</p>
<p><span class="math display">\[\log(y_{x=1}) = \beta_0 + \beta_1\]</span>
<span class="math display">\[\Rightarrow y_{x=1} = \exp(\beta_0 + \beta_1)\]</span></p>
<p><span class="math display">\[\Rightarrow y_{x=1} = \exp(\beta_0)\exp(\beta_1)\]</span></p>
<p><span class="math display">\[y_{x=0} = \exp(\beta_0)\]</span></p>
<p>Combining the last two equations, we can derive the following.</p>
<p><span class="math display">\[y_{x=1}/y_{x=0} = \exp(\beta_1)\]</span></p>
<p>Let’s interpret <span class="math inline">\(y_{x=1}\)</span> as the probability of being hit when <span class="math inline">\(x=1\)</span> (with an umbrella open), then relative risk or risk ratio is <span class="math inline">\(\exp(\beta_1)\)</span> !</p>
<p>The risk of being hit when having an umbrella open over the risk of being hit with nothing is exponential of <span class="math inline">\(\beta_1\)</span>, the coefficient. So if <span class="math inline">\(\beta_1\)</span> equals to 1, having an umbrella open is approximately 2.718( <span class="math inline">\(exp(1) = 2.718\cdots\)</span> ) times bigger. You are likely to be hit 2.718 times (with an umbrella opne) in average when people are hit with nothing one time.</p>
</div>
<div id="difficulties-of-applying-mle" class="section level2">
<h2>Difficulties of applying MLE</h2>
<p>Open any mathematical statistics, you will see wonderful characteristics of MLE(<strong>M</strong>aximum <strong>L</strong>ikelihood <strong>E</strong>stimate). So MLE is the way to go when we estimate the coefficients of a relative risk regression. But estimating a relative risk regression is difficult because it is optimizing the likelihood with parameters constrained. See the equation below.</p>
<p><span class="math display">\[\log(y|x_1) = \beta_0 + \beta_1 x_1\]</span>
<span class="math display">\[y|x_1 = \exp(\beta_0 + \beta_1 x_1)\]</span></p>
<p>Since <span class="math inline">\(y\)</span> stands for the probability, <span class="math inline">\(\exp(\beta_0 + \beta_1 x_1)\)</span> with any possible <span class="math inline">\(x_1\)</span> can not be less than <span class="math inline">\(0\)</span> or over than <span class="math inline">\(1\)</span> ! Another problem is that since parameters can be on the edge of the possible parameter space, it becomes difficult to estimate the variance of the parameter.</p>
<ul>
<li>[<strong>AD</strong>] Book for <strong>R power users</strong> : <a href="http://books.sumeun.org/?p=190">Data Analysis with R: Data Preprocessing and Visualization</a></li>
</ul>
</div>
<div id="using-r-for-relative-risk-regression" class="section level2">
<h2>Using R for <strong>Relative Risk Regression</strong></h2>
<p>We can use the traditional function <code>glm()</code> for relative risk regression but the package <code>logbin</code> seems to offer convenience and functionality. We can choose the estimating method with the package <code>logbin</code>. Let’s get to it!</p>
<p>First we will use Heart Attack Data(<code>data(heart)</code>). The description of the data can be found by <code>?heart</code></p>
<blockquote>
<p>This data set is a cross-tabulation of data on 16949 individuals who experienced a heart attack (ASSENT-2 Investigators, 1999). There are 4 categorical factors each at 3 levels, together with the number of patients and the number of deaths for each observed combination of the factors. This data set is useful for illustrating the convergence properties of glm and glm2.</p>
</blockquote>
<pre class="r"><code>library(dplyr)</code></pre>
<pre><code>##
## Attaching package: &#39;dplyr&#39;</code></pre>
<pre><code>## The following objects are masked from &#39;package:stats&#39;:
##
## filter, lag</code></pre>
<pre><code>## The following objects are masked from &#39;package:base&#39;:
##
## intersect, setdiff, setequal, union</code></pre>
<pre class="r"><code>library(tidyr)
library(ggplot2)
library(logbin) # https://github.com/mdonoghoe/logbin
require(glm2, quietly = TRUE)
data(heart)
head(heart)</code></pre>
<pre><code>## Deaths Patients AgeGroup Severity Delay Region
## 1 49 2611 1 1 1 1
## 2 1 74 1 1 1 2
## 3 2 96 1 1 1 3
## 4 30 2888 1 1 2 1
## 5 0 81 1 1 2 2
## 6 8 155 1 1 2 3</code></pre>
<p>We can fit the relative risk regression model to the data like the following. Notice that the response variable part in the fomula is <code>cbind(# of success, # of failure)</code>.</p>
<pre class="r"><code>start.p &lt;- sum(heart$Deaths) / sum(heart$Patients)
fit &lt;-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart)
fit$converged</code></pre>
<p>Using binary response variable, we can do like the following.</p>
<pre class="r"><code>sum(duplicated(heart %&gt;% select(AgeGroup:Region)))</code></pre>
<pre><code>## [1] 0</code></pre>
<pre class="r"><code>heart2 &lt;- heart %&gt;%
group_by(AgeGroup, Severity, Delay, Region) %&gt;%
summarise(data.frame(dead = c(rep(1,Deaths),
rep(0,Patients-Deaths)))) %&gt;%
ungroup()</code></pre>
<pre><code>## `summarise()` has grouped output by &#39;AgeGroup&#39;, &#39;Severity&#39;, &#39;Delay&#39;, &#39;Region&#39;.
## You can override using the `.groups` argument.</code></pre>
<pre class="r"><code>fit2 &lt;-
logbin(dead ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart2)
fit2$converged</code></pre>
<p>For me, it took LONG!!! Here is the faster way.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<pre class="r"><code>start.p &lt;- sum(heart$Deaths) / sum(heart$Patients)
fit &lt;-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart,
start = c(log(start.p), -rep(1e-4,8)),
method = &#39;glm2&#39;)
cat(&#39;Is fit converged? &#39;, fit$converged, &#39;\n&#39;)</code></pre>
<pre><code>## Is fit converged? TRUE</code></pre>
<pre class="r"><code>fit2 &lt;-
logbin(dead ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart2,
start = c(log(start.p), -rep(1e-4,8)),
method = &#39;glm2&#39;)
cat(&#39;Is fit2 converged? &#39;, fit2$converged, &#39;\n&#39;)</code></pre>
<pre><code>## Is fit2 converged? TRUE</code></pre>
<p>Here is a tip. Use the form of # of success and # of failure. Using binary response took longer!</p>
<p>The results are almost identical</p>
<pre class="r"><code>library(car)</code></pre>
<pre><code>## Loading required package: carData</code></pre>
<pre><code>##
## Attaching package: &#39;car&#39;</code></pre>
<pre><code>## The following object is masked from &#39;package:dplyr&#39;:
##
## recode</code></pre>
<pre class="r"><code>compareCoefs(fit, fit2)</code></pre>
<pre><code>## Calls:
## 1: logbin(formula = cbind(Deaths, Patients - Deaths) ~ factor(AgeGroup) +
## factor(Severity) + factor(Delay) + factor(Region), data = heart, start =
## c(log(start.p), -rep(1e-04, 8)), method = &quot;glm2&quot;)
## 2: logbin(formula = dead ~ factor(AgeGroup) + factor(Severity) +
## factor(Delay) + factor(Region), data = heart2, start = c(log(start.p),
## -rep(1e-04, 8)), method = &quot;glm2&quot;)
##
## Model 1 Model 2
## (Intercept) -4.0275 -4.0273
## SE 0.0889 0.0889
##
## factor(AgeGroup)2 1.104 1.104
## SE 0.089 0.089
##
## factor(AgeGroup)3 1.9268 1.9266
## SE 0.0924 0.0924
##
## factor(Severity)2 0.7035 0.7035
## SE 0.0701 0.0701
##
## factor(Severity)3 1.3767 1.3768
## SE 0.0955 0.0955
##
## factor(Delay)2 0.0590 0.0589
## SE 0.0693 0.0693
##
## factor(Delay)3 0.1718 0.1720
## SE 0.0808 0.0808
##
## factor(Region)2 0.0757 0.0757
## SE 0.1775 0.1775
##
## factor(Region)3 0.483 0.483
## SE 0.111 0.111
## </code></pre>
<p>The authors of <code>logbin</code> states that <code>logbin</code> solves problems that might pop up using other packages.</p>
<p>Let’s compare!</p>
<pre class="r"><code>start.p &lt;- sum(heart$Deaths) / sum(heart$Patients)
t.glm &lt;- system.time(
fit.glm &lt;-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart,
start = c(log(start.p), -rep(1e-4, 8)),
method = &quot;glm&quot;,
maxit = 10000)
)
t.glm2 &lt;- system.time(
fit.glm2 &lt;- update(fit.glm, method=&#39;glm2&#39;))
t.cem &lt;- system.time(
fit.cem &lt;- update(fit.glm, method = &quot;cem&quot;)
#fit.cem &lt;- update(fit.glm, method=&#39;cem&#39;, start = NULL)
)
t.em &lt;- system.time(
fit.em &lt;- update(fit.glm, method = &quot;em&quot;))
t.cem.acc &lt;- system.time(
fit.cem.acc &lt;- update(fit.cem, accelerate = &quot;squarem&quot;))
t.em.acc &lt;- system.time(
fit.em.acc &lt;- update(fit.em, accelerate = &quot;squarem&quot;))
objs = list(&quot;glm&quot;=fit.glm,
&quot;glm2&quot;=fit.glm2,
&quot;cem&quot;=fit.cem,
&quot;em&quot;=fit.em,
&quot;cem.acc&quot; = fit.cem.acc,
&quot;em.acc&quot; = fit.em.acc)
params = c(&#39;converged&#39;, &quot;loglik&quot;, &quot;iter&quot;)
to_dataframe = function(objs, params) {
#param = params[1]
#obj[[param]]
dat = data.frame(model=names(objs))
for (param in params) {
dat[[param]] = sapply(objs,
function(x)
x[[param]])
}
return(dat)
}
dat = to_dataframe(objs, params)
dat$time = c(t.glm[&#39;elapsed&#39;],
t.glm2[&#39;elapsed&#39;],
t.cem[&#39;elapsed&#39;],
t.em[&#39;elapsed&#39;],
t.cem.acc[&#39;elapsed&#39;],
t.em.acc[&#39;elapsed&#39;])</code></pre>
<p>Let’s see the result.</p>
<pre class="r"><code>print(dat)</code></pre>
<pre><code>## model converged loglik iter time
## 1 glm FALSE -186.7366 10000 1.61
## 2 glm2 TRUE -179.9016 14 0.00
## 3 cem TRUE -179.9016 223196, 8451 42.47
## 4 em TRUE -179.9016 6492 2.34
## 5 cem.acc TRUE -179.9016 4215, 114 3.78
## 6 em.acc TRUE -179.9016 81 0.09</code></pre>
<p>The authors of the package <code>logbin</code> stated that the cem is the best but the time it took was the longest. <code>glm2</code> was the fastest and has converged. But <code>glm2</code> requires sensible start points. So we do not tell which will win when the data is large and the model is more complex.</p>
<p>In the next post, I will explain how the model and the meaning of coefficient changes with different link functions.</p>
</div>
<div class="footnotes footnotes-end-of-document">
<hr />
<ol>
<li id="fn1"><p>This one uses <code>glm2</code> package. I think <code>logbin</code> is just a wrapper in this case. I omitted warnings and messages.<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</description>
<tag>regression</tag>
<tag>binary</tag>
<category>R</category>
</item>
<item>
<title>Why mean substition is a bad idea, almost always</title>
<link>https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/</link>
<pubDate>Fri, 25 Mar 2022 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/</guid>
<description>
<script src="https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/index_files/header-attrs/header-attrs.js"></script>
<p>Missing values can cause bias. So most books introduce imputation methods like mean substitution or LOCF(<strong>L</strong>ast <strong>O</strong>bservation <strong>C</strong>arried <strong>F</strong>orward). But in this post, I will explain why people say unconditional mean substitution is bad with a simple example.</p>
<div id="mechanisms-of-missingness" class="section level2">
<h2>Mechanisms of Missingness</h2>
<p>Little and Rubin(2002) categorized missingness into three categories.</p>
<ol style="list-style-type: decimal">
<li>MCAR(<strong>M</strong>issing <strong>C</strong>ompmletely <strong>A</strong>t <strong>R</strong>andom)</li>
<li>MAR(<strong>M</strong>issing <strong>A</strong>t <strong>R</strong>andom)</li>
<li>NMAR(<strong>N</strong>ot <strong>M</strong>issing <strong>A</strong>t <strong>R</strong>andom)</li>
</ol>
<p>In simple terms, MCAR is missing <strong>unconditionally</strong> random, MAR is missing <strong>conditionally</strong> random. And NMAR is neither of the both.</p>
<p>Here is a (unrealistic, but) simple example model.</p>
<p><span class="math display">\[\textrm{Weight} = 0.48 \times \textrm{Height} + e\]</span></p>
<p>I will cover only the missing weight values in this post. Assume that we can somehow find out what the real value is even if it is missing.</p>
<p>If missingness of weight is <strong>not related to other variables including itself</strong>, it is called <strong>MCAR</strong>. It is like missingness is <strong>totally determined by flipping coins</strong>.</p>
<p>If missingness of weight is <strong>conditional on height(and other variables in the model) but independent of the weight value</strong>, it is called <strong>MAR</strong>. Overall distribution of weight can be different dependent on missingness, but given the information of height(and other variables in the model), it is identical. Missing is independent of weight, given the height(and other variables in the model). Given the height (and other variables in the model), missing occurs independent of the height. So we can say missingness is <strong>determined by flipping coins conditional on the value of the height(and other variables in the model)</strong>.</p>
<p>If we digest the above using some of the basic probability rules, we get to the conclusion below.</p>
<blockquote>
<p>Missing value distribution is not different from observed value distribution, given the value of other variables in the model.</p>
</blockquote>
<p>The following might be true.</p>
<p><span class="math display">\[p(y|y_{\textrm{missing}}) \neq p(y|y_{\textrm{observed}})\]</span></p>
<p>But the following holds true.</p>
<p><span class="math display">\[p(y|y_{\textrm{missing}}, x_1, x_2, \cdots, x_p) = p(y|y_{\textrm{observed}}, x_1, x_2, \cdots, x_p)\]</span></p>
<p>It is like flipping coin again. But it could be a different coin for different value of explanatory variables.</p>
<p>If it is neither MCAR nor MAR, it is called <strong>NMAR</strong>. For example, if the missingness of weight is dependent on weight itself, it is NMAR.</p>
<p>For predictive models, it is sufficient of check if missingness is random conditional on other observed variables. But for causal model, we should consider if there is unobserved variables that might cause or be related to missingness. For now, I will consider only predictive models.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<ul>
<li>[<strong>AD</strong>] Book for <strong>R power users</strong> : <a href="http://books.sumeun.org/?p=190">Data Analysis with R: Data Preprocessing and Visualization</a></li>
</ul>
</div>
<div id="why-not-just-use-complete-data" class="section level2">
<h2>Why not just use complete data?</h2>
<p>The main problem is <strong>the bias</strong> introduced by missing data. As the number of missing increases, the bias could be huge. Another problem is decreasing sample size. Smaller sample size means <strong>less power</strong>.</p>
</div>
<div id="how-was-missing-data-handled-traditionally" class="section level2">
<h2>How was missing data handled, traditionally?</h2>
<ol style="list-style-type: decimal">
<li>Listwise Deletion : Use only complete data</li>
<li>Pairwise Deletion : Use all data available for each analysis</li>
<li>Unconditional mean substitution</li>
<li>Regression Imputation(Conditional mean substitution)</li>
<li>Stochastic Regression Imputation</li>
<li>Hot-Deck Imputation</li>
<li>Last Observation Carried Forward</li>
</ol>
<p><strong>Listwise Deletion</strong> means using only complete data. Ignore any data with at least one missing.</p>
<p><strong>Pairwise Deletion</strong> means using all available data for each estimates. Let’s say we need to compute covariance matrix of variables <span class="math inline">\(X_1\)</span> , <span class="math inline">\(X_2\)</span> , and <span class="math inline">\(Y\)</span> . We need to compute covariance of each pairs. We can use data missing in <span class="math inline">\(Y\)</span> for computing covariance of <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> . This method can cause the problem of non-positive definite covariance matrix estimate.</p>
<p><strong>Unconditional mean substitution</strong> imputes missing with the variable mean.</p>
<p><strong>Regression Imputation</strong> utilizes regression analysis and impute missing with regression mean.</p>
<p><strong>Stochastic Regression Imputation</strong> also uses regression analysis. It imputes missing with additional stochastic error term.</p>
<p><strong>Hot-Deck Imputation</strong> imputes missing with values from other complete data. Wikipedia describes it as below.</p>
<blockquote>
<p>A once-common method of imputation was hot-deck imputation where a missing value was imputed from a <strong>randomly selected similar record</strong>. The term “hot deck” dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients.</p>
</blockquote>
<p>Statisticians at the Census Bureau originally developed the hot-deck to deal with missing data in public-use data sets, and the procedure has a long history in survey applications (Scheuren, 2005; Enders, 2010).</p>
<p><strong>Last Observation Carried Forward</strong> imputes missing with last observed value for the same group in the logitudinal data.</p>
<p>Which one of those missing data analysis above is unbiased depends on the <strong>whether missing is MCAR, MAR, or NMAR</strong> and <strong>what one is estimating using the analysis</strong>. For instance, for estimating a regression parameters <span class="math inline">\(b_1\)</span> of <span class="math inline">\(y=b_0 + b_1 x_1\)</span>, listwise deletion is fairly appropriate for MCAR or MAR data, unless we care much about losing power. But estimating the mean <span class="math inline">\(y\)</span> averaging only observed <span class="math inline">\(y\)</span> could be seriously biased for MAR data if we use listwise deletion.</p>
<p>Deletion or imputation methods above result in complete data and we can use complete data analysis method. That’s why we prefer imputation or deletion more than other special methods developed for dealing with missing data.</p>
</div>
<div id="estimating-mean-y-mar-listwise-deletion" class="section level2">
<h2>Estimating mean <span class="math inline">\(y\)</span> : MAR &amp; listwise deletion</h2>
<p>If <span class="math inline">\(y\)</span> is missing complete random, estimating <span class="math inline">\(y\)</span> using only observed data is okay because <span class="math inline">\(p(y|y_\textrm{missing}) = p(y|y_\textrm{observed})\)</span> .</p>
</div>
<div id="estimating-mean-y-mar-mean-substitution" class="section level2">
<h2>Estimating mean <span class="math inline">\(y\)</span> : MAR &amp; mean substitution</h2>
<p>If <span class="math inline">\(y\)</span> is missing conditionally random, using only observed data could be problematic because <span class="math inline">\(p(y|y_\textrm{missing}) \neq p(y|y_\textrm{observed})\)</span> . Let’s say <span class="math inline">\(p(\textrm{missing}) \sim \textrm{height}\)</span> . The probability of missing on weight increases as the height increases. In that case, the probability of missing weight is increased as the weight increases. So just using the complete data means deleting higher weight values and introducing a bias on estimating the mean weight.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a></p>
</div>
<div id="estimating-mean-y-mar-mean-substitution-1" class="section level2">
<h2>Estimating mean <span class="math inline">\(y\)</span> : MAR &amp; mean substitution</h2>
<p>So we would better be using regression mean(estimated weight mean given the height). Using the conditional mean for missing data might lead to too small variance estimate but it is not biased.</p>
<ul>
<li>[<strong>AD</strong>] Book for <strong>R power users</strong> : <a href="http://books.sumeun.org/?p=190">Data Analysis with R: Data Preprocessing and Visualization</a></li>
</ul>
</div>
<div id="simulation" class="section level2">
<h2>Simulation</h2>
<div id="data" class="section level3">
<h3>Data</h3>
<pre class="r"><code>library(dplyr)
library(tidyr)
library(ggplot2)
# sample size 100
n &lt;- 100
# height mean 170, std 15
h &lt;- rnorm(100, 170, 15)
# true relation : weight = 0.48 * height
# given height, weight distribution N(0,7^2)
w &lt;- 0.48 * h + rnorm(n, 0, 7)
# weight population mean
w_pop &lt;- 170*0.48 # 81.6
h_pop &lt;- 170
# missing is dependent on height
w_missing &lt;- runif(n, 0, 1) &lt; (h-min(h))/(max(h)-min(h))
dat = data.frame(h=h,
w=ifelse(w_missing, NA, w),
w_complete = w,
w_missing = w_missing)
#dat %&gt;% gather()
ggplot(dat, aes(x=h, y=w_complete, col=factor(w_missing))) +
geom_point() +
scale_color_manual(values=c(&#39;black&#39;, &#39;grey&#39;))</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/index_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="listwise-deletion" class="section level3">
<h3>Listwise deletion</h3>
<p>Estimated weight mean from using complete data seems biased.</p>
<pre class="r"><code>## average weight?
mean(dat$w_complete)</code></pre>
<pre><code>## [1] 79.86512</code></pre>
<pre class="r"><code>wmean_est = mean(dat$w, na.rm = TRUE)
wmean_est</code></pre>
<pre><code>## [1] 76.49322</code></pre>
<pre class="r"><code>t.test(dat$w)</code></pre>
<pre><code>##
## One Sample t-test
##
## data: dat$w
## t = 58.46, df = 47, p-value &lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 73.86092 79.12552
## sample estimates:
## mean of x
## 76.49322</code></pre>
</div>
<div id="mean-substitution" class="section level3">
<h3>Mean substitution</h3>
<p>Simple but bad alternative in terms of bias for MAR data is mean susbstitution.</p>
<pre class="r"><code>w_mean &lt;- mean(dat$w, na.rm=TRUE)
dat$w_imputed &lt;- ifelse(dat$w_missing, w_mean, w)
wmean_est = mean(dat$w_imputed)
wmean_est</code></pre>
<pre><code>## [1] 76.49322</code></pre>
<pre class="r"><code>res &lt;- t.test(dat$w_imputed)
res</code></pre>
<pre><code>##
## One Sample t-test
##
## data: dat$w_imputed
## t = 122.46, df = 99, p-value &lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 75.25384 77.73260
## sample estimates:
## mean of x
## 76.49322</code></pre>
<p>The population weight mean is 81.6 but the estimated mean is 76.49. The confidence inteval is 75.25-77.73.</p>
</div>
<div id="regression-imputation" class="section level3">
<h3>Regression imputation</h3>
<p>We can use regression model to imput the missing values.</p>
<pre class="r"><code>mod &lt;- lm(w ~ h)
w_hat &lt;- predict(mod, dat)
#w_hat &lt;- coef(mod) %*% rbind(1,dat$h)
dat$w_imputed &lt;- ifelse(dat$w_missing, w_hat, w)
wmean_est = mean(dat$w_imputed)
wmean_est</code></pre>
<pre><code>## [1] 79.89369</code></pre>
<pre class="r"><code>res &lt;- t.test(dat$w_imputed)
res</code></pre>
<pre><code>##
## One Sample t-test
##
## data: dat$w_imputed
## t = 94.065, df = 99, p-value &lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 78.20839 81.57898
## sample estimates:
## mean of x
## 79.89369</code></pre>
<p>Estimated weight value is 79.89, 95%-confidence interval is 78.21-81.58. Compare this with estimated weight mean from listwise-deletion.</p>
<pre class="r"><code>mean(dat$w_complete)</code></pre>
<pre><code>## [1] 79.86512</code></pre>
<pre class="r"><code>t.test(dat$w_complete)</code></pre>
<pre><code>##
## One Sample t-test
##
## data: dat$w_complete
## t = 81.261, df = 99, p-value &lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 77.91499 81.81525
## sample estimates:
## mean of x
## 79.86512</code></pre>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Here are some explanatory posts and a paper that show how the causal model can be beneficial to understanding missing mechanisms: <a href="https://www.rdatagen.net/post/musings-on-missing-data/">Musings on missing data</a>, <a href="http://jakewestfall.org/blog/index.php/2017/08/22/using-causal-graphs-to-understand-missingness-and-how-to-deal-with-it/">Using causal graphs to understand missingness and how to deal with it</a>, <a href="https://proceedings.neurips.cc/paper/2013/file/0ff8033cf9437c213ee13937b1c4c455-Paper.pdf">Graphical Models for Inference with Missing Data</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>In fact, simulation studies suggest that mean imputation is possibly the worst missing data handling method available(Enders, 2010).<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</description>
<tag>missing</tag>
<category>R</category>
</item>
<item>
<title>measurement units</title>
<link>https://kwhkim.github.io/maxR/2022/03/15/measurement-units/</link>
<pubDate>Tue, 15 Mar 2022 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/2022/03/15/measurement-units/</guid>
<description>
<script src="https://kwhkim.github.io/maxR/2022/03/15/measurement-units/index_files/header-attrs/header-attrs.js"></script>
<p>Data <code>mtcars</code> has a column named <code>mpg</code>. <code>mpg</code> means <strong>m</strong>iles <strong>p</strong>er <strong>g</strong>allon. ‘Mile’ and ‘gallon’ are units for length and volume. A mile is approximately 1.6 kilometers and a gallon is approximately 3.7 liters. Mile and gallon sound unfamiliar to people who live outside England or U.S.A. because international standard units for length and volume are meter and liter.</p>
<p>In this post, we will learn how to convert a unit to another unit, for instance, we will convert mpg to km/L, which is more comprehensible to people who use SI units.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<div id="units-in-r" class="section level2">
<h2>Units in R</h2>
<p>Vectors(the most common data structure in R) do not contain information of measurement units. Units are implicit and units should be converted by users. But as history tells us, unit conversion should be treated carefully because it can cause serious damage to the whole project<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.</p>
</div>
<div id="package-units" class="section level2">
<h2>Package <code>units</code></h2>
<p>Using the package <code>units</code>, we can easily convert units accurately. And the data is plotted, units will be included in the x- or y-label automatically.</p>
<p>First install package <code>units</code>.</p>
<pre class="r"><code>install.packages(&#39;units&#39;)</code></pre>
<p>And load all the necessary packages and data</p>
<pre class="r"><code>library(units)</code></pre>
<pre><code>## udunits database from C:/Users/Seul/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml</code></pre>
<pre class="r"><code>library(dplyr, warn.conflicts = FALSE)
data(mtcars)</code></pre>
<p>To get the information about the data <code>mtcars</code>, we can do <code>help(mtcars)</code>. It will show the measurement unit for each column. <code>mpg</code> is measured in unit of <strong>m</strong>iles <strong>p</strong>er <strong>g</strong>allon, <code>disp</code> is measured in unit of cubic inch, <code>hp</code> is measured in unit of gross <strong>h</strong>orse<strong>p</strong>ower, <code>wt</code> is measured in unit of 1000 lbs, and <code>qsec</code> is measured in unit of sec per 1/4 mile.</p>
<p>It is sad that mpg(<strong>m</strong>iles <strong>p</strong>er <strong>g</strong>allon) is not registered in the package <code>units</code>, but we can register it ourselves. The code below installs a new unit called <code>mpg_US</code> as <code>interantional_mile/US_liquid_gallon</code>.</p>
<pre class="r"><code>install_unit(name=&#39;mpg_US&#39;, def=&#39;international_mile/US_liquid_gallon&#39;)</code></pre>
<p>Now we can use <code>mpg_US</code>. <code>mtcars$mpg</code> is measured in unit of mpg(US) and <code>mtcars$wt</code> is measured in unit of kilogram.</p>
<pre class="r"><code>units(mtcars$mpg) = &#39;mpg_US&#39;
units(mtcars$wt) = &#39;kg&#39;
mtcars$mpg %&gt;% head</code></pre>
<pre><code>## Units: [mpg_US]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1</code></pre>
<p>If we want to convert the unit mpg(US) to SI unit km/L, we need to do the below.<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a></p>
<pre class="r"><code>units(mtcars$mpg) = &#39;km/L&#39;
mtcars$mpg %&gt;% head</code></pre>
<pre><code>## Units: [km/L]
## [1] 8.928017 8.928017 9.693276 9.098075 7.950187 7.695101</code></pre>
<p>We can easily plot the relation between <code>mpg</code> and <code>wt</code> using the package <code>ggplot2</code>. But do not forget to load <code>ggforce</code> beforehand.</p>
<pre class="r"><code>library(ggplot2)
library(ggforce) # without this, the code below will raise error!</code></pre>
<pre><code>## Registered S3 method overwritten by &#39;ggforce&#39;:
## method from
## scale_type.units units</code></pre>
<pre class="r"><code>ggplot(data=mtcars,
aes(x=mpg, y=wt)) +
geom_point()</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/15/measurement-units/index_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="summary" class="section level2">
<h2>Summary</h2>
<ul>
<li>Use <code>units::units()&lt;-</code> for setting unit for measurements.
<ul>
<li>Use <code>units::units()&lt;-</code> for converting unit.</li>
<li>Use <code>units::units()&lt;-NULL</code> for deleting unit.</li>
</ul></li>
<li>Use <code>install_unit(name=, def=)</code> for introducing new units.</li>
<li>Use <code>valid_udunits()</code> to show all the units available from the package <code>units</code>.</li>
</ul>
<pre class="r"><code>valid_udunits() %&gt;% head </code></pre>
<pre><code>## udunits database from C:/Users/Seul/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml</code></pre>
<pre><code>## # A tibble: 6 x 11
## symbol symbol_aliases name_singular name_singular_aliases name_plural
## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
## 1 m &quot;&quot; meter &quot;metre&quot; &quot;&quot;
## 2 kg &quot;&quot; kilogram &quot;&quot; &quot;&quot;
## 3 s &quot;&quot; second &quot;&quot; &quot;&quot;
## 4 A &quot;&quot; ampere &quot;&quot; &quot;&quot;
## 5 K &quot;&quot; kelvin &quot;&quot; &quot;&quot;
## 6 mol &quot;&quot; mole &quot;&quot; &quot;&quot;
## # ... with 6 more variables: name_plural_aliases &lt;chr&gt;, def &lt;chr&gt;,
## # definition &lt;chr&gt;, comment &lt;chr&gt;, dimensionless &lt;lgl&gt;, source_xml &lt;chr&gt;</code></pre>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p><a href="https://en.wikipedia.org/wiki/International_System_of_Units">International System of Units</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>It is well known that the failure of MCO(<strong>M</strong>ars <strong>C</strong>limate <strong>O</strong>rbiter) is due to inadequate unit coversion.<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
<li id="fn3"><p>If the objective is simply to reset the unit, do <code>units(mtcars$mpg)=NULL; units(mtcars$mpg)='km/L'</code>. This will not convert unit but just replace the unit with another unit.<a href="#fnref3" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</description>
<tag>visualization</tag>
<tag>preprocessing</tag>
<category>R</category>
</item>
<item>
<title>Better Visualization of y|x for Big Data</title>
<link>https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/</link>
<pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/</guid>
<description>
<script src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/header-attrs/header-attrs.js"></script>
<div id="plotting-big-data-and-alpha" class="section level2">
<h2>Plotting Big Data and Alpha</h2>
<p>When plotting too many data points, we use <code>alpha=</code> because points are overlapped and indistinguishable.</p>
<pre class="r"><code>library(dplyr)
library(data.table)
library(cowplot)
library(ggplot2)
#N &lt;- 1000
N &lt;- 1000000
x &lt;- rnorm(N)
y &lt;- x + rnorm(N)
dat &lt;- data.table(x = x,
y = y)</code></pre>
<pre class="r"><code>dat %&gt;% ggplot(aes(x=x, y=y)) +
geom_point() +
labs(title=&#39;Original plot&#39;)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-2-1.png" width="70%" /></p>
<pre class="r"><code>dat %&gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;Using alpha=0.01&#39;)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-3-1.png" width="70%" /></p>
<p>Meaningful minimal <code>alpha</code> seems to be <code>0.01</code> for <code>ggplot2</code>. For very big data, <code>alpha=0.01</code> is not small enough. Looking at the plot the above, we see big blackness in the center. This might be that the densities in the center are the same or it might be that they reached to the ceiling of blackness even if the densities are not equal.</p>
<p>Bivariate normal distribution is too simple. Let’s try more complex data.</p>
<pre class="r"><code>x1 &lt;- rnorm(N/2)
y1 &lt;- 2*sin(x1) + rnorm(N/2)
x2 &lt;- rnorm(N/2)
y2 &lt;- 2*cos(x2) + rt(N/2, df=30)
dat &lt;- data.table(x=c(x1,x2),
y=c(y1,y2))</code></pre>
<div id="using-multiple-alphas" class="section level3">
<h3>Using multiple <code>alpha</code>s</h3>
<p>We can use multiple <code>alpha</code>s to avoid the problem of ceiling effect of constant <code>alpha</code>.</p>
<pre class="r"><code>p1 &lt;- dat %&gt;% ggplot(aes(x=x, y=y)) + geom_point() +
labs(title=&#39;alpha=1&#39;) + theme_minimal()
p2 &lt;- dat %&gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.1) +
labs(title=&#39;alpha=0.1&#39;) + theme_minimal()
p3 &lt;- dat %&gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.05) +
labs(title=&#39;alpha=0.05&#39;) + theme_minimal()
p4 &lt;- dat %&gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;alpha=0.01&#39;) + theme_minimal()
plot_grid(p1,p2,p3,p4,ncol=2)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-5-1.png" width="70%" /></p>
<p>But using the minimal <code>alpha=0.01</code> does not reveal the density differences in the center. We can try sampling in this case.</p>
</div>
<div id="sampling" class="section level3">
<h3>Sampling</h3>
<pre class="r"><code>p1 &lt;- dat %&gt;% sample_n(N/5) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;alpha=0.01 with 20% of data&#39;) + theme_minimal()
p2 &lt;- dat %&gt;% sample_n(N/10) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;alpha=0.01 with 10% of data&#39;) + theme_minimal()
p3 &lt;- dat %&gt;% sample_n(N/50) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;alpha=0.01 with 2% of data&#39;) + theme_minimal()
p4 &lt;- dat %&gt;% sample_n(N/100) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
labs(title=&#39;alpha=0.01 with 1% of data&#39;) + theme_minimal()
library(cowplot)
plot_grid(p1,p2,p3,p4,ncol=2)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-6-1.png" width="70%" /></p>
<p>But sampling utilizes only some part of the data. It depends on the chance so the results are different every time we plot.</p>
</div>
</div>
<div id="contidional-density-plot" class="section level2">
<h2>Contidional density plot</h2>
<p>There are several reason for plotting. One reason is doing EDA(<strong>E</strong>xploratory <strong>D</strong>ata <strong>A</strong>nalysis) before doing regression analysis such as linear model, ML, and DL.</p>
<p>The important thing in this case is to see what conditional density <span class="math inline">\(\mathbb{p}(y|x)\)</span> is like. Besides all plots above are focused on bivariate density.</p>
<p>To visualize the expectation of <span class="math inline">\(y\)</span> conditional on <span class="math inline">\(x\)</span> , non-parametric regression line in the following would help.</p>
<div id="regression-line" class="section level3">
<h3>Regression line</h3>
<pre class="r"><code>p1 &lt;- dat %&gt;% sample_n(N/100) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
geom_smooth(method=&#39;loess&#39;) +
labs(title=&#39;alpha=0.01 with 1% of data, loess&#39;) + theme_minimal()
p2 &lt;- dat %&gt;% sample_n(N/100) %&gt;%
ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
geom_smooth(method=&#39;auto&#39;) +
labs(title=&#39;alpha=0.01 with 1% of data, gam&#39;) + theme_minimal()
print(p1)</code></pre>
<pre><code>## `geom_smooth()` using formula &#39;y ~ x&#39;</code></pre>
<pre class="r"><code>print(p2)</code></pre>
<pre><code>## `geom_smooth()` using method = &#39;gam&#39; and formula &#39;y ~ s(x, bs = &quot;cs&quot;)&#39;</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-7-1.png" width="70%" /><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-7-2.png" width="70%" /></p>
<p>We can definitely see conditional expectation( <span class="math inline">\(\mathbb{E}[y|x] = \int y\ \mathbb{p}(y|x) dy\)</span> ), but we cannot figure out what the conditional density would be like.</p>
</div>
<div id="conditional-density" class="section level3">
<h3>Conditional density</h3>
<div id="binning-x" class="section level4">
<h4>binning <span class="math inline">\(x\)</span></h4>
<p>As we saw above, using small constant <code>alpha</code> prevents us from identifying the density difference when the points are too gathered and identifying data points where data points are so scarce. One possible solution would be binning <span class="math inline">\(x\)</span> and sampling.</p>
<pre class="r"><code>dat %&gt;%
mutate(xCut = cut(x, breaks=10)) %&gt;%
group_by(xCut) %&gt;%
do(sample_n(., 10000, replace=TRUE)) %&gt;%
ggplot(aes(x=x, y=y)) +
geom_point(alpha=0.01)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-8-1.png" width="70%" /></p>
<p>Better visualizing of conditional density but we can see artifacts. It must be because of too big bin size. Let’s try smaller bin size.</p>
<pre class="r"><code>dat %&gt;%
mutate(xCut = cut(x, breaks=50)) %&gt;%
group_by(xCut) %&gt;%
do(sample_n(., 2000, replace=TRUE)) %&gt;%
ggplot(aes(x=x, y=y)) +
geom_point(alpha=0.01)</code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-9-1.png" width="70%" /></p>
</div>
<div id="estimating-density-of-x" class="section level4">
<h4>Estimating density of <span class="math inline">\(x\)</span></h4>
<p>We can never say what is the best bin size. We would better estimate the probability density function of <span class="math inline">\(x\)</span>.</p>
<p>We also treated every <span class="math inline">\(x\)</span> as identical. We might take estimated probability function of <span class="math inline">\(x\)</span> into consideration, either using probability function itself or some function(ex. <span class="math inline">\(\log\)</span> ) of it.</p>
<pre class="r"><code>xDensity &lt;- ks::kde(dat$x)
dat$prob &lt;- predict(xDensity, x = dat$x)
#head(dat)
dat %&gt;%
mutate(xCut = cut(x, breaks=50)) %&gt;%
group_by(xCut) %&gt;%
do(sample_n(., 2000, replace=TRUE, weight=1/prob)) %&gt;%
ggplot(aes(x=x, y=y)) +
geom_point(alpha=0.01) </code></pre>
<p><img src="https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-10-1.png" width="70%" /></p>
</div>
</div>
</div>
</description>
<tag>big data</tag>
<tag>visualization</tag>
<category>R</category>
</item>
<item>
<title>character in UTF-8</title>
<link>https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/</link>
<pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/</guid>
<description>
<script src="https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/index_files/header-attrs/header-attrs.js"></script>
<div id="encoding" class="section level2">
<h2>Encoding</h2>
<p>Computer can store data only with 0s and 1s. Putting together a lot of 0s and 1s, a computer can present a bigger number. But if it want to store a letter, it needs a mapping of a number onto a letter. This mapping is called “<strong>encoding</strong>”.</p>
<p><strong>Encoding</strong> depends on the letters to store. Letters people use are different for different countries and languages. There are over 1000 encodings worldwide. But over 90% of the encodings used in the internet is UTF-8<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>.</p>
</div>
<div id="unicode" class="section level2">
<h2>Unicode</h2>
<p>We are in an internet era. It has become ordinary to send documents over the border of a country. But encodings usually were made for use in one country. So the documents from foreign country could be not read properly because the encoding was different<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.</p>
<p>Unicode was developed for this kind of problem. Unicode tries to have a mapping for all the characters that exist today or existed from the beginning of the history. Unicode Consortium<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> is a non-profit oganization that develops Unicode. The number that a character maps to is called <strong>code point</strong>. You can check the Code points from <strong>Unicode Code Chart</strong><a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a>. As the Unicode version increases, the number of character that Unicode can represent is also increasing<a href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a>.</p>
<p>Version 1.0.0(1991) supported 7129 characters from 24 languages. Version 14(2021) includes 144,697 characters from 159 languages. Version 6.0(2010 decided to support emojis because celluar phone makers demanded and it could have evloved into different encodings for different phone makers because of emojis, which was the reason for Unicode. Unicode tried to incorporate all the characters worldwide so any encodings can be converted to Unicode. Unicode can be materialized into specific encoding scheme such as UTF-8, UTF-16, UTF-32. Those encoding scheme add additional layer for compression.</p>
</div>
<div id="character-in-r" class="section level2">
<h2><code>character</code> in R</h2>
<p>R supports UTF-8. If a <code>character</code> is not in UTF-8, it can be converted to UTF-8 using the following code.</p>
<pre class="r"><code>x = &#39;한글&#39; # Hangul in Korean
y = iconv(x, to=&#39;UTF-8&#39;)
x
## [1] &quot;한글&quot;
y
## [1] &quot;한글&quot;
Encoding(x)
## [1] &quot;unknown&quot;
Encoding(y)
## [1] &quot;UTF-8&quot;</code></pre>
<p>UTF-8 is powerful. It can represent almost any characters. But the font is limited. Most fonts can not display all the characters. So there are cases that characters stored in a vector can not be properly displayed because the font in use does not support.</p>
<p>I propose the following function <code>u_chars</code> for inspection of UTF-8 characters.</p>
</div>
<div id="u_chars" class="section level2">
<h2><code>u_chars</code></h2>
<p>The following function <code>u_chars()</code> utilize the package <code>Unicode</code> and prints out the information of each character for a string. Unicode characters have labels so characters that can be not displayed properly can be identified.</p>
<pre class="r"><code>u_chars = function(s, encodings) {
stopifnot(class(s) == &quot;character&quot;)
stopifnot(length(s)==1)
if (Encoding(s) == &quot;unknown&quot;) {
s = iconv(s, to = &#39;UTF-8&#39;)
} else if (Encoding(s) != &#39;UTF-8&#39;) {
s = iconv(s, from = Encoding(s), to=&#39;UTF-8&#39;) }
dat = data.frame(ch = unlist(strsplit(s, &quot;&quot;))) # split characters
cps = sapply(dat$ch, utf8ToInt) # unicode codepoint
cps_hex = sprintf(&quot;%02x&quot;, cps) # convert to hexidecimal number
# hexidecimal numbers are displayed in one of the following styles &quot; ..ff&quot;, &quot; a1ff&quot;, &quot;011f3e&quot;
# first two digits are rarely used so they are shown blank when they are 00
# The following two digits are 00 when the code is in ASCII so they are shown ..
cps_hex =
ifelse(nchar(cps_hex) &gt; 2,
stringi::stri_pad(cps_hex, width = 4, side = &#39;left&#39;, pad = &#39;0&#39;),
stringi::stri_pad(cps_hex, width = 4, side = &#39;left&#39;, pad = &#39;.&#39;))
dat$codepoint =
ifelse(nchar(cps_hex) &gt; 4,
stringi::stri_pad(cps_hex, width=6, side=&#39;left&#39;, pad=&#39;0&#39;),
stringi::stri_pad(cps_hex, width=6, side=&#39;left&#39;, pad=&#39; &#39;))
# if given encodings=
if (!missing(encodings)) {
for (encoding in encodings) {
ch_enc = vector(mode=&#39;character&#39;, length=nrow(dat))
for (i in 1:nrow(dat)) {
ch = dat$ch[i]
ch_enc[i] =
paste0(sprintf(&quot;%02x&quot;,
as.integer(unlist(
iconv(ch, from = &#39;UTF-8&#39;,
to=encoding, toRaw=TRUE)))),
collapse = &#39; &#39;)
}
dat$enc = ch_enc
names(dat)[length(names(dat))] = paste0(&#39;enc.&#39;, encoding)
}
}
dat$label = Unicode::u_char_label(cps);
dat
}</code></pre>
<pre class="r"><code>u_chars(&quot;\ufeff\u0041\ub098\u2211\U00010384&quot;)</code></pre>
<pre><code>## ch codepoint label
## 1 &lt;U+FEFF&gt; feff ZERO WIDTH NO-BREAK SPACE
## 2 A ..41 LATIN CAPITAL LETTER A
## 3 나 b098 HANGUL SYLLABLE NA
## 4 ∑ 2211 N-ARY SUMMATION
## 5 &lt;U+00010384&gt; 010384 UGARITIC LETTER DELTA</code></pre>
<p>You can see how the characters will be encoded in other encoding scheme using <code>encodings =</code>.</p>
<pre class="r"><code>u_chars(&quot;\ufeff\u0041똠\u2211\U00010384&quot;, encodings = c(&quot;CP949&quot;, &quot;latin1&quot;))</code></pre>
<pre><code>## ch codepoint enc.CP949 enc.latin1 label
## 1 &lt;U+FEFF&gt; feff ZERO WIDTH NO-BREAK SPACE
## 2 A ..41 41 41 LATIN CAPITAL LETTER A
## 3 똠 b620 8c 63 HANGUL SYLLABLE DDOM
## 4 ∑ 2211 a2 b2 N-ARY SUMMATION
## 5 &lt;U+00010384&gt; 010384 UGARITIC LETTER DELTA</code></pre>
</div>
<div id="another-application" class="section level2">
<h2>Another application</h2>
<pre class="r"><code>library(stringi)
x = &#39;\u0423\u043a\u0440\u0430\u0457\u043d\u0430&#39;
Encoding(x)
## [1] &quot;UTF-8&quot;
#x = iconv(x, to=&#39;UTF-8&#39;)
cat(x); cat(&#39;\n&#39;)
## Укра&lt;U+0457&gt;на
y = stri_trans_nfd(x)
cat(y); cat(&#39;\n&#39;)
## Укра&lt;U+0456&gt;&lt;U+0308&gt;на
u_chars(x)
## ch codepoint label
## 1 У 0423 CYRILLIC CAPITAL LETTER U
## 2 к 043a CYRILLIC SMALL LETTER KA
## 3 р 0440 CYRILLIC SMALL LETTER ER
## 4 а 0430 CYRILLIC SMALL LETTER A
## 5 &lt;U+0457&gt; 0457 CYRILLIC SMALL LETTER YI
## 6 н 043d CYRILLIC SMALL LETTER EN
## 7 а 0430 CYRILLIC SMALL LETTER A
u_chars(y)
## ch codepoint label
## 1 У 0423 CYRILLIC CAPITAL LETTER U
## 2 к 043a CYRILLIC SMALL LETTER KA
## 3 р 0440 CYRILLIC SMALL LETTER ER
## 4 а 0430 CYRILLIC SMALL LETTER A
## 5 &lt;U+0456&gt; 0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
## 6 &lt;U+0308&gt; 0308 COMBINING DIAERESIS
## 7 н 043d CYRILLIC SMALL LETTER EN
## 8 а 0430 CYRILLIC SMALL LETTER A</code></pre>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p><a href="https://w3techs.com/technologies/cross/character_encoding/ranking" class="uri">https://w3techs.com/technologies/cross/character_encoding/ranking</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p><a href="https://en.wikipedia.org/wiki/Mojibake" class="uri">https://en.wikipedia.org/wiki/Mojibake</a><a href="#fnref2" class="footnote-back">↩︎</a></p></li>
<li id="fn3"><p><a href="https://en.wikipedia.org/wiki/Unicode_Consortium" class="uri">https://en.wikipedia.org/wiki/Unicode_Consortium</a><a href="#fnref3" class="footnote-back">↩︎</a></p></li>
<li id="fn4"><p><a href="http://www.unicode.org/charts/" class="uri">http://www.unicode.org/charts/</a><a href="#fnref4" class="footnote-back">↩︎</a></p></li>
<li id="fn5"><p>There are several reasons for this. Unicode can embrace new languages or new characters can be found for only embraced languages.<a href="#fnref5" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</description>
<tag>character</tag>
<tag>preprocessing</tag>
<tag>encoding</tag>
<category>R</category>
</item>
<item>
<title>About</title>
<link>https://kwhkim.github.io/maxR/about/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://kwhkim.github.io/maxR/about/</guid>
<description><h2 id="greetings">Greetings!</h2>
<p>Welcome to &ldquo;R to the max!&rdquo; This website is on R for Data Analysis. The posts are mostly translated from <a href="http://ds.sumeun.org/">Sumeun Data Science</a>. Here is some info about the maintainer of the site. Check out the links.</p>
<h3 id="education">Education</h3>
<ul>
<li>
<p>Seoul National University. B.S. in Physics.</p>
</li>
<li>
<p>Seoul National University. Ph.D. in Cognitive Science.</p>
</li>
</ul>
<h3 id="books">Books</h3>
<ul>
<li>
<p>Kim, K. H. (2022). <a href="http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;mallGb=KOR&amp;barcode=9791196014445&amp;orderClick=LEa&amp;Kc=">R로 하는 빅데이터 분석: 데이터 전처리와 시각화.</a> Big Data Analysis with R: Data preprocessing and visualization, 3rd. Sumeun. Seoul.</p>
</li>
<li>
<p>Kim, K. H. (2019). <a href="http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;mallGb=KOR&amp;barcode=9791196014407&amp;orderClick=LAG&amp;Kc=">고등학교 인수분해 완전 정복.</a> Conquering high school factoring. Sumeun. Seoul.</p>
</li>
<li>
<p>Kim, K. H., Kwak. M. Y., Lee, C. S. (2017). <a href="http://www.yes24.com/Product/Goods/43244145">수학의 숨은 원리.</a> Hidden priciples of Mathematics. Sumeun. Seoul.</p>
</li>
<li>
<p>Kim, K. H. (2013). <a href="http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;mallGb=KOR&amp;barcode=9788961057103&amp;orderClick=LAG&amp;Kc=">기초 통계학의 숨은 원리 이해하기.</a> Understanding the hidden principles of basic statistics. Kyungmoon. Seoul.</p>
</li>
</ul>
<h3 id="papers">Papers</h3>
<ul>
<li>
<p>Hyosoo Moon, Kwonhyun Kim, Hyun-Soo Lee, Moonseo Park, Trefor P. Williams, Bosik
Son, and Jae-Youl Chun(2020). Cost Performance Comparison of Design-Build and
Design-Bid-Build for Building and Civil Projects Using Mediation Analysis. Journal of Construction Engineering and Management.</p>
</li>
<li>
<p>Kim, Y. H., Jeon, J. H., Choe, E. K., Lee, B., Kim, K., Seo, J. (2016). TimeAware:
Leveraging framing effects to enhance personal productivity. In Proceedings of the
SIGCHI conference on human factors in computing systems.</p>
</li>
</ul>
<h3 id="awards">Awards</h3>
<ul>
<li><a href="https://www.asce.org/career-growth/awards-and-honors/thomas-fitch-rowland-prize">ASCE(<strong>A</strong>merican <strong>S</strong>ociety of <strong>C</strong>ivil <strong>E</strong>ngineers) Thomas Fitch Rowland Prize.</a> <a href="https://www.mk.co.kr/news/society/view/2022/02/170329/">(2022).</a></li>
</ul>
<!---
your comment goes here
and here
## Links
Github](https://github.com/kwhkim)
Twitter](https://twitter.com/kwnhkim)
-->
<h3 id="links">Links</h3>
<ul>
<li>
<p><a href="https://www.bigbookofr.com/?fbclid=IwAR0LFCPsikgV_qgIZOhgHPCJ5ZWsSQbEEPNzm8-EM9ci0IyL5d0Jo3HvYbM">Big Book of R</a></p>
</li>
<li>
<p><a href="https://www.r-bloggers.com/">R-bloggers</a></p>
</li>
</ul>
</description>
</item>
</channel>
</rss>