index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>R to the max</title>
    <link>https://kwhkim.github.io/maxR/</link>
    <description>Recent content on R to the max</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Wed, 24 Aug 2022 00:00:00 +0000</lastBuildDate><atom:link href="https://kwhkim.github.io/maxR/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Relative risk regression (1/2)</title>
      <link>https://kwhkim.github.io/maxR/2022/08/24/relative-risk-regression/</link>
      <pubDate>Wed, 24 Aug 2022 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/2022/08/24/relative-risk-regression/</guid>
      <description>


&lt;p&gt;When the outcome variable is binary such as alive/dead or yes/no, the most popular analytic method is &lt;strong&gt;logistic regression&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\textrm{logit}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots \]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The name “&lt;strong&gt;logistic&lt;/strong&gt;” might have come from the equation below, which can be derived from applying the inverse function of logit on the both side of the equation above.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[ \mathbb{E}[y] = \textrm{logistic}( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The link function of the &lt;strong&gt;logistic&lt;/strong&gt; regression is &lt;code&gt;logit()&lt;/code&gt;. We can replace it with &lt;code&gt;log()&lt;/code&gt; and the result looks like the below.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[ \textrm{log}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \cdots \]&lt;/span&gt;
This equation represents “&lt;strong&gt;Relative Risk Regression&lt;/strong&gt;” a.k.a &lt;strong&gt;log-binomial regression&lt;/strong&gt;.&lt;/p&gt;
&lt;div id=&#34;risk-relative-risk&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Risk, Relative Risk&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Risk&lt;/strong&gt; is just another term for probability. For instance, “the probability of being hit by a lightening” can be rephrased to “the &lt;strong&gt;risk&lt;/strong&gt; of being hit by a lightening”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Relative risk&lt;/strong&gt; or &lt;strong&gt;risk ratio(RR)&lt;/strong&gt; is the ratio of two probability(risk). Relative risk is to compare the probabilities of two events. For example, compare the probability of being hit by a lightening when standing alone with the probability of being hit by a lightening having an umbrella open. If we divide the second probability by the first probability, we get how many times we are likely to be hit by a lightening when having an umbrella open compared to having nothing at all. This is &lt;strong&gt;relative risk&lt;/strong&gt;, or &lt;strong&gt;risk ratio&lt;/strong&gt;. If it is 2, in average we will get hit twice (with an umbrella open) every one hit (with nothing).&lt;/p&gt;
&lt;p&gt;The name “&lt;strong&gt;Relative Risk&lt;/strong&gt; Regression” seems to come from the fact that the coefficients of relative risk regression is closely related to relative risk! Let’s imagine a relative risk regression with only one predictor &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; , which is &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; for having an umbrella open, and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; for having nothing. We can compare &lt;span class=&#34;math inline&#34;&gt;\(y|x=0\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y|x=1\)&lt;/span&gt; .&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\log(y_{x=1}) = \beta_0 + \beta_1\]&lt;/span&gt;
&lt;span class=&#34;math display&#34;&gt;\[\Rightarrow y_{x=1} = \exp(\beta_0 + \beta_1)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\Rightarrow y_{x=1} = \exp(\beta_0)\exp(\beta_1)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[y_{x=0} = \exp(\beta_0)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Combining the last two equations, we can derive the following.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[y_{x=1}/y_{x=0}  = \exp(\beta_1)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let’s interpret &lt;span class=&#34;math inline&#34;&gt;\(y_{x=1}\)&lt;/span&gt; as the probability of being hit when &lt;span class=&#34;math inline&#34;&gt;\(x=1\)&lt;/span&gt; (with an umbrella open), then relative risk or risk ratio is &lt;span class=&#34;math inline&#34;&gt;\(\exp(\beta_1)\)&lt;/span&gt; !&lt;/p&gt;
&lt;p&gt;The risk of being hit when having an umbrella open over the risk of being hit with nothing is exponential of &lt;span class=&#34;math inline&#34;&gt;\(\beta_1\)&lt;/span&gt;, the coefficient. So if &lt;span class=&#34;math inline&#34;&gt;\(\beta_1\)&lt;/span&gt; equals to 1, having an umbrella open is approximately 2.718( &lt;span class=&#34;math inline&#34;&gt;\(exp(1) = 2.718\cdots\)&lt;/span&gt; ) times bigger. You are likely to be hit 2.718 times (with an umbrella opne) in average when people are hit with nothing one time.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;difficulties-of-applying-mle&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Difficulties of applying MLE&lt;/h2&gt;
&lt;p&gt;Open any mathematical statistics, you will see wonderful characteristics of MLE(&lt;strong&gt;M&lt;/strong&gt;aximum &lt;strong&gt;L&lt;/strong&gt;ikelihood &lt;strong&gt;E&lt;/strong&gt;stimate). So MLE is the way to go when we estimate the coefficients of a relative risk regression. But estimating a relative risk regression is difficult because it is optimizing the likelihood with parameters constrained. See the equation below.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\log(y|x_1) = \beta_0 + \beta_1 x_1\]&lt;/span&gt;
&lt;span class=&#34;math display&#34;&gt;\[y|x_1 = \exp(\beta_0 + \beta_1 x_1)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Since &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; stands for the probability, &lt;span class=&#34;math inline&#34;&gt;\(\exp(\beta_0 + \beta_1 x_1)\)&lt;/span&gt; with any possible &lt;span class=&#34;math inline&#34;&gt;\(x_1\)&lt;/span&gt; can not be less than &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; or over than &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; ! Another problem is that since parameters can be on the edge of the possible parameter space, it becomes difficult to estimate the variance of the parameter.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[&lt;strong&gt;AD&lt;/strong&gt;] Book for &lt;strong&gt;R power users&lt;/strong&gt; : &lt;a href=&#34;http://books.sumeun.org/?p=190&#34;&gt;Data Analysis with R: Data Preprocessing and Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;using-r-for-relative-risk-regression&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using R for &lt;strong&gt;Relative Risk Regression&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We can use the traditional function &lt;code&gt;glm()&lt;/code&gt; for relative risk regression but the package &lt;code&gt;logbin&lt;/code&gt; seems to offer convenience and functionality. We can choose the estimating method with the package &lt;code&gt;logbin&lt;/code&gt;. Let’s get to it!&lt;/p&gt;
&lt;p&gt;First we will use Heart Attack Data(&lt;code&gt;data(heart)&lt;/code&gt;). The description of the data can be found by &lt;code&gt;?heart&lt;/code&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This data set is a cross-tabulation of data on 16949 individuals who experienced a heart attack (ASSENT-2 Investigators, 1999). There are 4 categorical factors each at 3 levels, together with the number of patients and the number of deaths for each observed combination of the factors. This data set is useful for illustrating the convergence properties of glm and glm2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &amp;#39;dplyr&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &amp;#39;package:stats&amp;#39;:
## 
##     filter, lag&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &amp;#39;package:base&amp;#39;:
## 
##     intersect, setdiff, setequal, union&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyr)
library(ggplot2)
library(logbin) # https://github.com/mdonoghoe/logbin
require(glm2, quietly = TRUE)
data(heart)

head(heart)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Deaths Patients AgeGroup Severity Delay Region
## 1     49     2611        1        1     1      1
## 2      1       74        1        1     1      2
## 3      2       96        1        1     1      3
## 4     30     2888        1        1     2      1
## 5      0       81        1        1     2      2
## 6      8      155        1        1     2      3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can fit the relative risk regression model to the data like the following. Notice that the response variable part in the fomula is &lt;code&gt;cbind(# of success, # of failure)&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;start.p &amp;lt;- sum(heart$Deaths) / sum(heart$Patients)
fit &amp;lt;- 
  logbin(cbind(Deaths, Patients-Deaths) ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart)
fit$converged&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using binary response variable, we can do like the following.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sum(duplicated(heart %&amp;gt;% select(AgeGroup:Region)))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heart2 &amp;lt;- heart %&amp;gt;% 
  group_by(AgeGroup, Severity, Delay, Region) %&amp;gt;%
  summarise(data.frame(dead = c(rep(1,Deaths),
                                rep(0,Patients-Deaths)))) %&amp;gt;%
  ungroup()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## `summarise()` has grouped output by &amp;#39;AgeGroup&amp;#39;, &amp;#39;Severity&amp;#39;, &amp;#39;Delay&amp;#39;, &amp;#39;Region&amp;#39;.
## You can override using the `.groups` argument.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit2 &amp;lt;- 
  logbin(dead ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region),
         data = heart2)
fit2$converged&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For me, it took LONG!!! Here is the faster way.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;start.p &amp;lt;- sum(heart$Deaths) / sum(heart$Patients)
fit &amp;lt;- 
  logbin(cbind(Deaths, Patients-Deaths) ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart,
         start = c(log(start.p), -rep(1e-4,8)),
         method = &amp;#39;glm2&amp;#39;)
cat(&amp;#39;Is fit converged? &amp;#39;, fit$converged, &amp;#39;\n&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Is fit converged?  TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit2 &amp;lt;- 
  logbin(dead ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart2,
         start = c(log(start.p), -rep(1e-4,8)),
         method = &amp;#39;glm2&amp;#39;)
cat(&amp;#39;Is fit2 converged? &amp;#39;, fit2$converged, &amp;#39;\n&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Is fit2 converged?  TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here is a tip. Use the form of # of success and # of failure. Using binary response took longer!&lt;/p&gt;
&lt;p&gt;The results are almost identical&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(car)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: carData&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &amp;#39;car&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following object is masked from &amp;#39;package:dplyr&amp;#39;:
## 
##     recode&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;compareCoefs(fit, fit2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Calls:
## 1: logbin(formula = cbind(Deaths, Patients - Deaths) ~ factor(AgeGroup) + 
##   factor(Severity) + factor(Delay) + factor(Region), data = heart, start = 
##   c(log(start.p), -rep(1e-04, 8)), method = &amp;quot;glm2&amp;quot;)
## 2: logbin(formula = dead ~ factor(AgeGroup) + factor(Severity) + 
##   factor(Delay) + factor(Region), data = heart2, start = c(log(start.p), 
##   -rep(1e-04, 8)), method = &amp;quot;glm2&amp;quot;)
## 
##                   Model 1 Model 2
## (Intercept)       -4.0275 -4.0273
## SE                 0.0889  0.0889
##                                  
## factor(AgeGroup)2   1.104   1.104
## SE                  0.089   0.089
##                                  
## factor(AgeGroup)3  1.9268  1.9266
## SE                 0.0924  0.0924
##                                  
## factor(Severity)2  0.7035  0.7035
## SE                 0.0701  0.0701
##                                  
## factor(Severity)3  1.3767  1.3768
## SE                 0.0955  0.0955
##                                  
## factor(Delay)2     0.0590  0.0589
## SE                 0.0693  0.0693
##                                  
## factor(Delay)3     0.1718  0.1720
## SE                 0.0808  0.0808
##                                  
## factor(Region)2    0.0757  0.0757
## SE                 0.1775  0.1775
##                                  
## factor(Region)3     0.483   0.483
## SE                  0.111   0.111
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The authors of &lt;code&gt;logbin&lt;/code&gt; states that &lt;code&gt;logbin&lt;/code&gt; solves problems that might pop up using other packages.&lt;/p&gt;
&lt;p&gt;Let’s compare!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;start.p &amp;lt;- sum(heart$Deaths) / sum(heart$Patients)
t.glm &amp;lt;- system.time(
  fit.glm &amp;lt;- 
    logbin(cbind(Deaths, Patients-Deaths) ~ 
             factor(AgeGroup) + factor(Severity) 
             + factor(Delay) + factor(Region), 
           data = heart,
           start = c(log(start.p), -rep(1e-4, 8)), 
           method = &amp;quot;glm&amp;quot;, 
           maxit = 10000)
)

t.glm2 &amp;lt;- system.time(
  fit.glm2 &amp;lt;- update(fit.glm, method=&amp;#39;glm2&amp;#39;))
t.cem &amp;lt;- system.time(
  fit.cem &amp;lt;- update(fit.glm, method = &amp;quot;cem&amp;quot;)
  #fit.cem &amp;lt;- update(fit.glm, method=&amp;#39;cem&amp;#39;, start = NULL)
  )
t.em &amp;lt;- system.time(
  fit.em &amp;lt;- update(fit.glm, method = &amp;quot;em&amp;quot;))
t.cem.acc &amp;lt;- system.time(
  fit.cem.acc &amp;lt;- update(fit.cem, accelerate = &amp;quot;squarem&amp;quot;))
t.em.acc &amp;lt;- system.time(
  fit.em.acc &amp;lt;- update(fit.em, accelerate = &amp;quot;squarem&amp;quot;))

objs = list(&amp;quot;glm&amp;quot;=fit.glm, 
            &amp;quot;glm2&amp;quot;=fit.glm2,
            &amp;quot;cem&amp;quot;=fit.cem, 
            &amp;quot;em&amp;quot;=fit.em, 
            &amp;quot;cem.acc&amp;quot; = fit.cem.acc, 
            &amp;quot;em.acc&amp;quot; = fit.em.acc)
params = c(&amp;#39;converged&amp;#39;, &amp;quot;loglik&amp;quot;, &amp;quot;iter&amp;quot;)

to_dataframe = function(objs, params) {
  #param = params[1]
  #obj[[param]]
  dat = data.frame(model=names(objs))
  
  for (param in params) {
    dat[[param]] = sapply(objs, 
                          function(x)
                            x[[param]])
  }
  
  return(dat)
}

dat = to_dataframe(objs, params)

dat$time = c(t.glm[&amp;#39;elapsed&amp;#39;], 
             t.glm2[&amp;#39;elapsed&amp;#39;],
             t.cem[&amp;#39;elapsed&amp;#39;], 
             t.em[&amp;#39;elapsed&amp;#39;], 
             t.cem.acc[&amp;#39;elapsed&amp;#39;], 
             t.em.acc[&amp;#39;elapsed&amp;#39;])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see the result.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     model converged    loglik         iter  time
## 1     glm     FALSE -186.7366        10000  1.61
## 2    glm2      TRUE -179.9016           14  0.00
## 3     cem      TRUE -179.9016 223196, 8451 42.47
## 4      em      TRUE -179.9016         6492  2.34
## 5 cem.acc      TRUE -179.9016    4215, 114  3.78
## 6  em.acc      TRUE -179.9016           81  0.09&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The authors of the package &lt;code&gt;logbin&lt;/code&gt; stated that the cem is the best but the time it took was the longest. &lt;code&gt;glm2&lt;/code&gt; was the fastest and has converged. But &lt;code&gt;glm2&lt;/code&gt; requires sensible start points. So we do not tell which will win when the data is large and the model is more complex.&lt;/p&gt;
&lt;p&gt;In the next post, I will explain how the model and the meaning of coefficient changes with different link functions.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes footnotes-end-of-document&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;This one uses &lt;code&gt;glm2&lt;/code&gt; package. I think &lt;code&gt;logbin&lt;/code&gt; is just a wrapper in this case. I omitted warnings and messages.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
      
            <tag>regression</tag>
      
            <tag>binary</tag>
      
      
            <category>R</category>
      
    </item>
    
    <item>
      <title>Why mean substition is a bad idea, almost always</title>
      <link>https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/</link>
      <pubDate>Fri, 25 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/</guid>
      <description>
&lt;script src=&#34;https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Missing values can cause bias. So most books introduce imputation methods like mean substitution or LOCF(&lt;strong&gt;L&lt;/strong&gt;ast &lt;strong&gt;O&lt;/strong&gt;bservation &lt;strong&gt;C&lt;/strong&gt;arried &lt;strong&gt;F&lt;/strong&gt;orward). But in this post, I will explain why people say unconditional mean substitution is bad with a simple example.&lt;/p&gt;
&lt;div id=&#34;mechanisms-of-missingness&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Mechanisms of Missingness&lt;/h2&gt;
&lt;p&gt;Little and Rubin(2002) categorized missingness into three categories.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;MCAR(&lt;strong&gt;M&lt;/strong&gt;issing &lt;strong&gt;C&lt;/strong&gt;ompmletely &lt;strong&gt;A&lt;/strong&gt;t &lt;strong&gt;R&lt;/strong&gt;andom)&lt;/li&gt;
&lt;li&gt;MAR(&lt;strong&gt;M&lt;/strong&gt;issing &lt;strong&gt;A&lt;/strong&gt;t &lt;strong&gt;R&lt;/strong&gt;andom)&lt;/li&gt;
&lt;li&gt;NMAR(&lt;strong&gt;N&lt;/strong&gt;ot &lt;strong&gt;M&lt;/strong&gt;issing &lt;strong&gt;A&lt;/strong&gt;t &lt;strong&gt;R&lt;/strong&gt;andom)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In simple terms, MCAR is missing &lt;strong&gt;unconditionally&lt;/strong&gt; random, MAR is missing &lt;strong&gt;conditionally&lt;/strong&gt; random. And NMAR is neither of the both.&lt;/p&gt;
&lt;p&gt;Here is a (unrealistic, but) simple example model.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\textrm{Weight} = 0.48 \times \textrm{Height} + e\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;I will cover only the missing weight values in this post. Assume that we can somehow find out what the real value is even if it is missing.&lt;/p&gt;
&lt;p&gt;If missingness of weight is &lt;strong&gt;not related to other variables including itself&lt;/strong&gt;, it is called &lt;strong&gt;MCAR&lt;/strong&gt;. It is like missingness is &lt;strong&gt;totally determined by flipping coins&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If missingness of weight is &lt;strong&gt;conditional on height(and other variables in the model) but independent of the weight value&lt;/strong&gt;, it is called &lt;strong&gt;MAR&lt;/strong&gt;. Overall distribution of weight can be different dependent on missingness, but given the information of height(and other variables in the model), it is identical. Missing is independent of weight, given the height(and other variables in the model). Given the height (and other variables in the model), missing occurs independent of the height. So we can say missingness is &lt;strong&gt;determined by flipping coins conditional on the value of the height(and other variables in the model)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If we digest the above using some of the basic probability rules, we get to the conclusion below.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Missing value distribution is not different from observed value distribution, given the value of other variables in the model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The following might be true.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(y|y_{\textrm{missing}}) \neq p(y|y_{\textrm{observed}})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;But the following holds true.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(y|y_{\textrm{missing}}, x_1, x_2, \cdots, x_p) = p(y|y_{\textrm{observed}}, x_1, x_2, \cdots, x_p)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It is like flipping coin again. But it could be a different coin for different value of explanatory variables.&lt;/p&gt;
&lt;p&gt;If it is neither MCAR nor MAR, it is called &lt;strong&gt;NMAR&lt;/strong&gt;. For example, if the missingness of weight is dependent on weight itself, it is NMAR.&lt;/p&gt;
&lt;p&gt;For predictive models, it is sufficient of check if missingness is random conditional on other observed variables. But for causal model, we should consider if there is unobserved variables that might cause or be related to missingness. For now, I will consider only predictive models.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[&lt;strong&gt;AD&lt;/strong&gt;] Book for &lt;strong&gt;R power users&lt;/strong&gt; : &lt;a href=&#34;http://books.sumeun.org/?p=190&#34;&gt;Data Analysis with R: Data Preprocessing and Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;why-not-just-use-complete-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Why not just use complete data?&lt;/h2&gt;
&lt;p&gt;The main problem is &lt;strong&gt;the bias&lt;/strong&gt; introduced by missing data. As the number of missing increases, the bias could be huge. Another problem is decreasing sample size. Smaller sample size means &lt;strong&gt;less power&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-was-missing-data-handled-traditionally&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How was missing data handled, traditionally?&lt;/h2&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Listwise Deletion : Use only complete data&lt;/li&gt;
&lt;li&gt;Pairwise Deletion : Use all data available for each analysis&lt;/li&gt;
&lt;li&gt;Unconditional mean substitution&lt;/li&gt;
&lt;li&gt;Regression Imputation(Conditional mean substitution)&lt;/li&gt;
&lt;li&gt;Stochastic Regression Imputation&lt;/li&gt;
&lt;li&gt;Hot-Deck Imputation&lt;/li&gt;
&lt;li&gt;Last Observation Carried Forward&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Listwise Deletion&lt;/strong&gt; means using only complete data. Ignore any data with at least one missing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pairwise Deletion&lt;/strong&gt; means using all available data for each estimates. Let’s say we need to compute covariance matrix of variables &lt;span class=&#34;math inline&#34;&gt;\(X_1\)&lt;/span&gt; , &lt;span class=&#34;math inline&#34;&gt;\(X_2\)&lt;/span&gt; , and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; . We need to compute covariance of each pairs. We can use data missing in &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; for computing covariance of &lt;span class=&#34;math inline&#34;&gt;\(X_1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(X_2\)&lt;/span&gt; . This method can cause the problem of non-positive definite covariance matrix estimate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unconditional mean substitution&lt;/strong&gt; imputes missing with the variable mean.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regression Imputation&lt;/strong&gt; utilizes regression analysis and impute missing with regression mean.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Regression Imputation&lt;/strong&gt; also uses regression analysis. It imputes missing with additional stochastic error term.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hot-Deck Imputation&lt;/strong&gt; imputes missing with values from other complete data. Wikipedia describes it as below.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A once-common method of imputation was hot-deck imputation where a missing value was imputed from a &lt;strong&gt;randomly selected similar record&lt;/strong&gt;. The term “hot deck” dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Statisticians at the Census Bureau originally developed the hot-deck to deal with missing data in public-use data sets, and the procedure has a long history in survey applications (Scheuren, 2005; Enders, 2010).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Last Observation Carried Forward&lt;/strong&gt; imputes missing with last observed value for the same group in the logitudinal data.&lt;/p&gt;
&lt;p&gt;Which one of those missing data analysis above is unbiased depends on the &lt;strong&gt;whether missing is MCAR, MAR, or NMAR&lt;/strong&gt; and &lt;strong&gt;what one is estimating using the analysis&lt;/strong&gt;. For instance, for estimating a regression parameters &lt;span class=&#34;math inline&#34;&gt;\(b_1\)&lt;/span&gt; of &lt;span class=&#34;math inline&#34;&gt;\(y=b_0 + b_1 x_1\)&lt;/span&gt;, listwise deletion is fairly appropriate for MCAR or MAR data, unless we care much about losing power. But estimating the mean &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; averaging only observed &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; could be seriously biased for MAR data if we use listwise deletion.&lt;/p&gt;
&lt;p&gt;Deletion or imputation methods above result in complete data and we can use complete data analysis method. That’s why we prefer imputation or deletion more than other special methods developed for dealing with missing data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;estimating-mean-y-mar-listwise-deletion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Estimating mean &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; : MAR &amp;amp; listwise deletion&lt;/h2&gt;
&lt;p&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is missing complete random, estimating &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; using only observed data is okay because &lt;span class=&#34;math inline&#34;&gt;\(p(y|y_\textrm{missing}) = p(y|y_\textrm{observed})\)&lt;/span&gt; .&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;estimating-mean-y-mar-mean-substitution&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Estimating mean &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; : MAR &amp;amp; mean substitution&lt;/h2&gt;
&lt;p&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is missing conditionally random, using only observed data could be problematic because &lt;span class=&#34;math inline&#34;&gt;\(p(y|y_\textrm{missing}) \neq p(y|y_\textrm{observed})\)&lt;/span&gt; . Let’s say &lt;span class=&#34;math inline&#34;&gt;\(p(\textrm{missing}) \sim \textrm{height}\)&lt;/span&gt; . The probability of missing on weight increases as the height increases. In that case, the probability of missing weight is increased as the weight increases. So just using the complete data means deleting higher weight values and introducing a bias on estimating the mean weight.&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;estimating-mean-y-mar-mean-substitution-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Estimating mean &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; : MAR &amp;amp; mean substitution&lt;/h2&gt;
&lt;p&gt;So we would better be using regression mean(estimated weight mean given the height). Using the conditional mean for missing data might lead to too small variance estimate but it is not biased.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[&lt;strong&gt;AD&lt;/strong&gt;] Book for &lt;strong&gt;R power users&lt;/strong&gt; : &lt;a href=&#34;http://books.sumeun.org/?p=190&#34;&gt;Data Analysis with R: Data Preprocessing and Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;simulation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Simulation&lt;/h2&gt;
&lt;div id=&#34;data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Data&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(tidyr)
library(ggplot2)

# sample size 100
n &amp;lt;- 100
# height mean 170, std 15
h &amp;lt;- rnorm(100, 170, 15)  
# true relation : weight =  0.48 * height
# given height, weight distribution N(0,7^2)
w &amp;lt;- 0.48 * h  + rnorm(n, 0, 7)

# weight population mean
w_pop &amp;lt;- 170*0.48 # 81.6
h_pop &amp;lt;- 170
# missing is dependent on height
w_missing &amp;lt;- runif(n, 0, 1) &amp;lt; (h-min(h))/(max(h)-min(h)) 

dat = data.frame(h=h,
                 w=ifelse(w_missing, NA, w),
                 w_complete = w,
                 w_missing = w_missing)

#dat %&amp;gt;% gather()
ggplot(dat, aes(x=h, y=w_complete, col=factor(w_missing))) + 
  geom_point() + 
  scale_color_manual(values=c(&amp;#39;black&amp;#39;, &amp;#39;grey&amp;#39;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/25/why-single-mean-imputation-is-a-bad-idea-almost-always/index_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;listwise-deletion&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Listwise deletion&lt;/h3&gt;
&lt;p&gt;Estimated weight mean from using complete data seems biased.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## average weight?
mean(dat$w_complete)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 79.86512&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wmean_est = mean(dat$w, na.rm = TRUE)
wmean_est&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 76.49322&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;t.test(dat$w)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  dat$w
## t = 58.46, df = 47, p-value &amp;lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  73.86092 79.12552
## sample estimates:
## mean of x 
##  76.49322&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;mean-substitution&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Mean substitution&lt;/h3&gt;
&lt;p&gt;Simple but bad alternative in terms of bias for MAR data is mean susbstitution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;w_mean &amp;lt;- mean(dat$w, na.rm=TRUE)
dat$w_imputed &amp;lt;- ifelse(dat$w_missing, w_mean, w)

wmean_est = mean(dat$w_imputed)
wmean_est&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 76.49322&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;res &amp;lt;- t.test(dat$w_imputed)
res&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  dat$w_imputed
## t = 122.46, df = 99, p-value &amp;lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  75.25384 77.73260
## sample estimates:
## mean of x 
##  76.49322&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The population weight mean is 81.6 but the estimated mean is 76.49. The confidence inteval is 75.25-77.73.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;regression-imputation&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Regression imputation&lt;/h3&gt;
&lt;p&gt;We can use regression model to imput the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod &amp;lt;- lm(w ~ h)
w_hat &amp;lt;- predict(mod, dat)
#w_hat &amp;lt;- coef(mod) %*% rbind(1,dat$h)
dat$w_imputed &amp;lt;- ifelse(dat$w_missing, w_hat, w)

wmean_est = mean(dat$w_imputed)
wmean_est&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 79.89369&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;res &amp;lt;- t.test(dat$w_imputed)
res&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  dat$w_imputed
## t = 94.065, df = 99, p-value &amp;lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  78.20839 81.57898
## sample estimates:
## mean of x 
##  79.89369&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Estimated weight value is 79.89, 95%-confidence interval is 78.21-81.58. Compare this with estimated weight mean from listwise-deletion.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mean(dat$w_complete)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 79.86512&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;t.test(dat$w_complete)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  dat$w_complete
## t = 81.261, df = 99, p-value &amp;lt; 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  77.91499 81.81525
## sample estimates:
## mean of x 
##  79.86512&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Here are some explanatory posts and a paper that show how the causal model can be beneficial to understanding missing mechanisms: &lt;a href=&#34;https://www.rdatagen.net/post/musings-on-missing-data/&#34;&gt;Musings on missing data&lt;/a&gt;, &lt;a href=&#34;http://jakewestfall.org/blog/index.php/2017/08/22/using-causal-graphs-to-understand-missingness-and-how-to-deal-with-it/&#34;&gt;Using causal graphs to understand missingness and how to deal with it&lt;/a&gt;, &lt;a href=&#34;https://proceedings.neurips.cc/paper/2013/file/0ff8033cf9437c213ee13937b1c4c455-Paper.pdf&#34;&gt;Graphical Models for Inference with Missing Data&lt;/a&gt;&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;In fact, simulation studies suggest that mean imputation is possibly the worst missing data handling method available(Enders, 2010).&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
      
            <tag>missing</tag>
      
      
            <category>R</category>
      
    </item>
    
    <item>
      <title>measurement units</title>
      <link>https://kwhkim.github.io/maxR/2022/03/15/measurement-units/</link>
      <pubDate>Tue, 15 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/2022/03/15/measurement-units/</guid>
      <description>
&lt;script src=&#34;https://kwhkim.github.io/maxR/2022/03/15/measurement-units/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Data &lt;code&gt;mtcars&lt;/code&gt; has a column named &lt;code&gt;mpg&lt;/code&gt;. &lt;code&gt;mpg&lt;/code&gt; means &lt;strong&gt;m&lt;/strong&gt;iles &lt;strong&gt;p&lt;/strong&gt;er &lt;strong&gt;g&lt;/strong&gt;allon. ‘Mile’ and ‘gallon’ are units for length and volume. A mile is approximately 1.6 kilometers and a gallon is approximately 3.7 liters. Mile and gallon sound unfamiliar to people who live outside England or U.S.A. because international standard units for length and volume are meter and liter.&lt;/p&gt;
&lt;p&gt;In this post, we will learn how to convert a unit to another unit, for instance, we will convert mpg to km/L, which is more comprehensible to people who use SI units.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div id=&#34;units-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Units in R&lt;/h2&gt;
&lt;p&gt;Vectors(the most common data structure in R) do not contain information of measurement units. Units are implicit and units should be converted by users. But as history tells us, unit conversion should be treated carefully because it can cause serious damage to the whole project&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;package-units&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Package &lt;code&gt;units&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Using the package &lt;code&gt;units&lt;/code&gt;, we can easily convert units accurately. And the data is plotted, units will be included in the x- or y-label automatically.&lt;/p&gt;
&lt;p&gt;First install package &lt;code&gt;units&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;#39;units&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And load all the necessary packages and data&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(units)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## udunits database from C:/Users/Seul/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr, warn.conflicts = FALSE)
data(mtcars)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To get the information about the data &lt;code&gt;mtcars&lt;/code&gt;, we can do &lt;code&gt;help(mtcars)&lt;/code&gt;. It will show the measurement unit for each column. &lt;code&gt;mpg&lt;/code&gt; is measured in unit of &lt;strong&gt;m&lt;/strong&gt;iles &lt;strong&gt;p&lt;/strong&gt;er &lt;strong&gt;g&lt;/strong&gt;allon, &lt;code&gt;disp&lt;/code&gt; is measured in unit of cubic inch, &lt;code&gt;hp&lt;/code&gt; is measured in unit of gross &lt;strong&gt;h&lt;/strong&gt;orse&lt;strong&gt;p&lt;/strong&gt;ower, &lt;code&gt;wt&lt;/code&gt; is measured in unit of 1000 lbs, and &lt;code&gt;qsec&lt;/code&gt; is measured in unit of sec per 1/4 mile.&lt;/p&gt;
&lt;p&gt;It is sad that mpg(&lt;strong&gt;m&lt;/strong&gt;iles &lt;strong&gt;p&lt;/strong&gt;er &lt;strong&gt;g&lt;/strong&gt;allon) is not registered in the package &lt;code&gt;units&lt;/code&gt;, but we can register it ourselves. The code below installs a new unit called &lt;code&gt;mpg_US&lt;/code&gt; as &lt;code&gt;interantional_mile/US_liquid_gallon&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install_unit(name=&amp;#39;mpg_US&amp;#39;, def=&amp;#39;international_mile/US_liquid_gallon&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can use &lt;code&gt;mpg_US&lt;/code&gt;. &lt;code&gt;mtcars$mpg&lt;/code&gt; is measured in unit of mpg(US) and &lt;code&gt;mtcars$wt&lt;/code&gt; is measured in unit of kilogram.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;units(mtcars$mpg) = &amp;#39;mpg_US&amp;#39;
units(mtcars$wt) = &amp;#39;kg&amp;#39;
mtcars$mpg %&amp;gt;% head&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Units: [mpg_US]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we want to convert the unit mpg(US) to SI unit km/L, we need to do the below.&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;units(mtcars$mpg) = &amp;#39;km/L&amp;#39;
mtcars$mpg %&amp;gt;% head&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Units: [km/L]
## [1] 8.928017 8.928017 9.693276 9.098075 7.950187 7.695101&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can easily plot the relation between &lt;code&gt;mpg&lt;/code&gt; and &lt;code&gt;wt&lt;/code&gt; using the package &lt;code&gt;ggplot2&lt;/code&gt;. But do not forget to load &lt;code&gt;ggforce&lt;/code&gt; beforehand.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(ggforce) # without this, the code below will raise error!&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Registered S3 method overwritten by &amp;#39;ggforce&amp;#39;:
##   method           from 
##   scale_type.units units&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data=mtcars,
       aes(x=mpg, y=wt)) + 
  geom_point()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/15/measurement-units/index_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;units::units()&amp;lt;-&lt;/code&gt; for setting unit for measurements.
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;units::units()&amp;lt;-&lt;/code&gt; for converting unit.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;units::units()&amp;lt;-NULL&lt;/code&gt; for deleting unit.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;install_unit(name=, def=)&lt;/code&gt; for introducing new units.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;valid_udunits()&lt;/code&gt; to show all the units available from the package &lt;code&gt;units&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;valid_udunits() %&amp;gt;% head &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## udunits database from C:/Users/Seul/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 11
##   symbol symbol_aliases name_singular name_singular_aliases name_plural
##   &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;          &amp;lt;chr&amp;gt;         &amp;lt;chr&amp;gt;                 &amp;lt;chr&amp;gt;      
## 1 m      &amp;quot;&amp;quot;             meter         &amp;quot;metre&amp;quot;               &amp;quot;&amp;quot;         
## 2 kg     &amp;quot;&amp;quot;             kilogram      &amp;quot;&amp;quot;                    &amp;quot;&amp;quot;         
## 3 s      &amp;quot;&amp;quot;             second        &amp;quot;&amp;quot;                    &amp;quot;&amp;quot;         
## 4 A      &amp;quot;&amp;quot;             ampere        &amp;quot;&amp;quot;                    &amp;quot;&amp;quot;         
## 5 K      &amp;quot;&amp;quot;             kelvin        &amp;quot;&amp;quot;                    &amp;quot;&amp;quot;         
## 6 mol    &amp;quot;&amp;quot;             mole          &amp;quot;&amp;quot;                    &amp;quot;&amp;quot;         
## # ... with 6 more variables: name_plural_aliases &amp;lt;chr&amp;gt;, def &amp;lt;chr&amp;gt;,
## #   definition &amp;lt;chr&amp;gt;, comment &amp;lt;chr&amp;gt;, dimensionless &amp;lt;lgl&amp;gt;, source_xml &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/International_System_of_Units&#34;&gt;International System of Units&lt;/a&gt;&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;It is well known that the failure of MCO(&lt;strong&gt;M&lt;/strong&gt;ars &lt;strong&gt;C&lt;/strong&gt;limate &lt;strong&gt;O&lt;/strong&gt;rbiter) is due to inadequate unit coversion.&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;If the objective is simply to reset the unit, do &lt;code&gt;units(mtcars$mpg)=NULL; units(mtcars$mpg)=&#39;km/L&#39;&lt;/code&gt;. This will not convert unit but just replace the unit with another unit.&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
      
            <tag>visualization</tag>
      
            <tag>preprocessing</tag>
      
      
            <category>R</category>
      
    </item>
    
    <item>
      <title>Better Visualization of y|x for Big Data</title>
      <link>https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/</link>
      <pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/</guid>
      <description>
&lt;script src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;plotting-big-data-and-alpha&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plotting Big Data and Alpha&lt;/h2&gt;
&lt;p&gt;When plotting too many data points, we use &lt;code&gt;alpha=&lt;/code&gt; because points are overlapped and indistinguishable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(data.table)
library(cowplot)
library(ggplot2)

#N &amp;lt;- 1000
N &amp;lt;- 1000000

x &amp;lt;- rnorm(N)
y &amp;lt;- x + rnorm(N)


dat &amp;lt;- data.table(x = x, 
                  y = y)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x=x, y=y)) + 
  geom_point() + 
  labs(title=&amp;#39;Original plot&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) + 
  labs(title=&amp;#39;Using alpha=0.01&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Meaningful minimal &lt;code&gt;alpha&lt;/code&gt; seems to be &lt;code&gt;0.01&lt;/code&gt; for &lt;code&gt;ggplot2&lt;/code&gt;. For very big data, &lt;code&gt;alpha=0.01&lt;/code&gt; is not small enough. Looking at the plot the above, we see big blackness in the center. This might be that the densities in the center are the same or it might be that they reached to the ceiling of blackness even if the densities are not equal.&lt;/p&gt;
&lt;p&gt;Bivariate normal distribution is too simple. Let’s try more complex data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;x1 &amp;lt;- rnorm(N/2)
y1 &amp;lt;- 2*sin(x1) + rnorm(N/2)
x2 &amp;lt;- rnorm(N/2)
y2 &amp;lt;- 2*cos(x2) + rt(N/2, df=30)

dat &amp;lt;- data.table(x=c(x1,x2),
                  y=c(y1,y2))&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;using-multiple-alphas&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Using multiple &lt;code&gt;alpha&lt;/code&gt;s&lt;/h3&gt;
&lt;p&gt;We can use multiple &lt;code&gt;alpha&lt;/code&gt;s to avoid the problem of ceiling effect of constant &lt;code&gt;alpha&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p1 &amp;lt;- dat %&amp;gt;% ggplot(aes(x=x, y=y)) + geom_point() + 
  labs(title=&amp;#39;alpha=1&amp;#39;) + theme_minimal()
p2 &amp;lt;- dat %&amp;gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.1) + 
  labs(title=&amp;#39;alpha=0.1&amp;#39;) + theme_minimal()
p3 &amp;lt;- dat %&amp;gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.05) +
  labs(title=&amp;#39;alpha=0.05&amp;#39;) + theme_minimal()
p4 &amp;lt;- dat %&amp;gt;% ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
  labs(title=&amp;#39;alpha=0.01&amp;#39;) + theme_minimal()

plot_grid(p1,p2,p3,p4,ncol=2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;But using the minimal &lt;code&gt;alpha=0.01&lt;/code&gt; does not reveal the density differences in the center. We can try sampling in this case.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;sampling&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Sampling&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p1 &amp;lt;- dat %&amp;gt;% sample_n(N/5) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) + 
  labs(title=&amp;#39;alpha=0.01 with 20% of data&amp;#39;) + theme_minimal()
p2 &amp;lt;- dat %&amp;gt;% sample_n(N/10) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) + 
  labs(title=&amp;#39;alpha=0.01 with 10% of data&amp;#39;) + theme_minimal()
p3 &amp;lt;- dat %&amp;gt;% sample_n(N/50) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
  labs(title=&amp;#39;alpha=0.01 with 2% of data&amp;#39;) + theme_minimal()
p4 &amp;lt;- dat %&amp;gt;% sample_n(N/100) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
  labs(title=&amp;#39;alpha=0.01 with 1% of data&amp;#39;) + theme_minimal()
library(cowplot)
plot_grid(p1,p2,p3,p4,ncol=2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;But sampling utilizes only some part of the data. It depends on the chance so the results are different every time we plot.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contidional-density-plot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Contidional density plot&lt;/h2&gt;
&lt;p&gt;There are several reason for plotting. One reason is doing EDA(&lt;strong&gt;E&lt;/strong&gt;xploratory &lt;strong&gt;D&lt;/strong&gt;ata &lt;strong&gt;A&lt;/strong&gt;nalysis) before doing regression analysis such as linear model, ML, and DL.&lt;/p&gt;
&lt;p&gt;The important thing in this case is to see what conditional density &lt;span class=&#34;math inline&#34;&gt;\(\mathbb{p}(y|x)\)&lt;/span&gt; is like. Besides all plots above are focused on bivariate density.&lt;/p&gt;
&lt;p&gt;To visualize the expectation of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; conditional on &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; , non-parametric regression line in the following would help.&lt;/p&gt;
&lt;div id=&#34;regression-line&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Regression line&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p1 &amp;lt;- dat %&amp;gt;% sample_n(N/100) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
  geom_smooth(method=&amp;#39;loess&amp;#39;) + 
  labs(title=&amp;#39;alpha=0.01 with 1% of data, loess&amp;#39;) + theme_minimal() 
p2 &amp;lt;- dat %&amp;gt;% sample_n(N/100) %&amp;gt;% 
  ggplot(aes(x=x, y=y)) + geom_point(alpha=0.01) +
  geom_smooth(method=&amp;#39;auto&amp;#39;) + 
  labs(title=&amp;#39;alpha=0.01 with 1% of data, gam&amp;#39;) + theme_minimal() 
print(p1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## `geom_smooth()` using formula &amp;#39;y ~ x&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(p2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## `geom_smooth()` using method = &amp;#39;gam&amp;#39; and formula &amp;#39;y ~ s(x, bs = &amp;quot;cs&amp;quot;)&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;70%&#34; /&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-7-2.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can definitely see conditional expectation( &lt;span class=&#34;math inline&#34;&gt;\(\mathbb{E}[y|x] = \int y\  \mathbb{p}(y|x) dy\)&lt;/span&gt; ), but we cannot figure out what the conditional density would be like.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conditional-density&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Conditional density&lt;/h3&gt;
&lt;div id=&#34;binning-x&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;binning &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;As we saw above, using small constant &lt;code&gt;alpha&lt;/code&gt; prevents us from identifying the density difference when the points are too gathered and identifying data points where data points are so scarce. One possible solution would be binning &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and sampling.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat %&amp;gt;% 
  mutate(xCut = cut(x, breaks=10)) %&amp;gt;%
  group_by(xCut) %&amp;gt;%
  do(sample_n(., 10000, replace=TRUE)) %&amp;gt;%
  ggplot(aes(x=x, y=y)) + 
  geom_point(alpha=0.01)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Better visualizing of conditional density but we can see artifacts. It must be because of too big bin size. Let’s try smaller bin size.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat %&amp;gt;% 
  mutate(xCut = cut(x, breaks=50)) %&amp;gt;%
  group_by(xCut) %&amp;gt;%
  do(sample_n(., 2000, replace=TRUE)) %&amp;gt;%
  ggplot(aes(x=x, y=y)) + 
  geom_point(alpha=0.01)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;estimating-density-of-x&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Estimating density of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;We can never say what is the best bin size. We would better estimate the probability density function of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;We also treated every &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; as identical. We might take estimated probability function of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; into consideration, either using probability function itself or some function(ex. &lt;span class=&#34;math inline&#34;&gt;\(\log\)&lt;/span&gt; ) of it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xDensity &amp;lt;- ks::kde(dat$x)
dat$prob &amp;lt;- predict(xDensity, x = dat$x)

#head(dat)
dat %&amp;gt;% 
  mutate(xCut = cut(x, breaks=50)) %&amp;gt;%
  group_by(xCut) %&amp;gt;%
  do(sample_n(., 2000, replace=TRUE, weight=1/prob)) %&amp;gt;%
  
  ggplot(aes(x=x, y=y)) + 
  geom_point(alpha=0.01) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://kwhkim.github.io/maxR/2022/03/06/better-visualization-of-y-x-for-big-data/index_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;70%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
      
            <tag>big data</tag>
      
            <tag>visualization</tag>
      
      
            <category>R</category>
      
    </item>
    
    <item>
      <title>character in UTF-8</title>
      <link>https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/</link>
      <pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/</guid>
      <description>
&lt;script src=&#34;https://kwhkim.github.io/maxR/2022/03/06/character-in-utf-8/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;encoding&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Encoding&lt;/h2&gt;
&lt;p&gt;Computer can store data only with 0s and 1s. Putting together a lot of 0s and 1s, a computer can present a bigger number. But if it want to store a letter, it needs a mapping of a number onto a letter. This mapping is called “&lt;strong&gt;encoding&lt;/strong&gt;”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Encoding&lt;/strong&gt; depends on the letters to store. Letters people use are different for different countries and languages. There are over 1000 encodings worldwide. But over 90% of the encodings used in the internet is UTF-8&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;unicode&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Unicode&lt;/h2&gt;
&lt;p&gt;We are in an internet era. It has become ordinary to send documents over the border of a country. But encodings usually were made for use in one country. So the documents from foreign country could be not read properly because the encoding was different&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unicode was developed for this kind of problem. Unicode tries to have a mapping for all the characters that exist today or existed from the beginning of the history. Unicode Consortium&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; is a non-profit oganization that develops Unicode. The number that a character maps to is called &lt;strong&gt;code point&lt;/strong&gt;. You can check the Code points from &lt;strong&gt;Unicode Code Chart&lt;/strong&gt;&lt;a href=&#34;#fn4&#34; class=&#34;footnote-ref&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;. As the Unicode version increases, the number of character that Unicode can represent is also increasing&lt;a href=&#34;#fn5&#34; class=&#34;footnote-ref&#34; id=&#34;fnref5&#34;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Version 1.0.0(1991) supported 7129 characters from 24 languages. Version 14(2021) includes 144,697 characters from 159 languages. Version 6.0(2010 decided to support emojis because celluar phone makers demanded and it could have evloved into different encodings for different phone makers because of emojis, which was the reason for Unicode. Unicode tried to incorporate all the characters worldwide so any encodings can be converted to Unicode. Unicode can be materialized into specific encoding scheme such as UTF-8, UTF-16, UTF-32. Those encoding scheme add additional layer for compression.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;character-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;character&lt;/code&gt; in R&lt;/h2&gt;
&lt;p&gt;R supports UTF-8. If a &lt;code&gt;character&lt;/code&gt; is not in UTF-8, it can be converted to UTF-8 using the following code.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;x = &amp;#39;한글&amp;#39; # Hangul in Korean
y = iconv(x, to=&amp;#39;UTF-8&amp;#39;)
x
## [1] &amp;quot;한글&amp;quot;
y
## [1] &amp;quot;한글&amp;quot;
Encoding(x)
## [1] &amp;quot;unknown&amp;quot;
Encoding(y)
## [1] &amp;quot;UTF-8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;UTF-8 is powerful. It can represent almost any characters. But the font is limited. Most fonts can not display all the characters. So there are cases that characters stored in a vector can not be properly displayed because the font in use does not support.&lt;/p&gt;
&lt;p&gt;I propose the following function &lt;code&gt;u_chars&lt;/code&gt; for inspection of UTF-8 characters.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;u_chars&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;u_chars&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;The following function &lt;code&gt;u_chars()&lt;/code&gt; utilize the package &lt;code&gt;Unicode&lt;/code&gt; and prints out the information of each character for a string. Unicode characters have labels so characters that can be not displayed properly can be identified.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;u_chars = function(s, encodings) {
  stopifnot(class(s) == &amp;quot;character&amp;quot;)
  stopifnot(length(s)==1)
  if (Encoding(s) == &amp;quot;unknown&amp;quot;) {
    s = iconv(s, to = &amp;#39;UTF-8&amp;#39;)
  } else if (Encoding(s) != &amp;#39;UTF-8&amp;#39;) {
    s = iconv(s, from = Encoding(s), to=&amp;#39;UTF-8&amp;#39;)  } 
  
  dat = data.frame(ch = unlist(strsplit(s, &amp;quot;&amp;quot;))) # split characters
  cps = sapply(dat$ch, utf8ToInt)                # unicode codepoint
  cps_hex = sprintf(&amp;quot;%02x&amp;quot;, cps)                 # convert to hexidecimal number
  
  # hexidecimal numbers are displayed in one of the following styles &amp;quot;  ..ff&amp;quot;, &amp;quot;  a1ff&amp;quot;, &amp;quot;011f3e&amp;quot;
  # first two digits are rarely used so they are shown blank when they are 00
  # The following two digits are 00 when the code is in ASCII so they are shown ..
  cps_hex = 
    ifelse(nchar(cps_hex) &amp;gt; 2,
           stringi::stri_pad(cps_hex, width = 4, side = &amp;#39;left&amp;#39;, pad = &amp;#39;0&amp;#39;),
           stringi::stri_pad(cps_hex, width = 4, side = &amp;#39;left&amp;#39;, pad = &amp;#39;.&amp;#39;))
  dat$codepoint = 
    ifelse(nchar(cps_hex) &amp;gt; 4,
           stringi::stri_pad(cps_hex, width=6, side=&amp;#39;left&amp;#39;, pad=&amp;#39;0&amp;#39;),
           stringi::stri_pad(cps_hex, width=6, side=&amp;#39;left&amp;#39;, pad=&amp;#39; &amp;#39;))
  
  # if given encodings=
  if (!missing(encodings)) {
    for (encoding in encodings) {
      ch_enc = vector(mode=&amp;#39;character&amp;#39;, length=nrow(dat))
      for (i in 1:nrow(dat)) {
        ch = dat$ch[i]
        ch_enc[i] = 
          paste0(sprintf(&amp;quot;%02x&amp;quot;, 
                         as.integer(unlist(
                           iconv(ch, from = &amp;#39;UTF-8&amp;#39;, 
                                     to=encoding, toRaw=TRUE)))),
                 collapse = &amp;#39; &amp;#39;)
      }
      dat$enc = ch_enc
      names(dat)[length(names(dat))] = paste0(&amp;#39;enc.&amp;#39;, encoding)  
    }
  }
  dat$label = Unicode::u_char_label(cps); 
  dat
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;u_chars(&amp;quot;\ufeff\u0041\ub098\u2211\U00010384&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             ch codepoint                     label
## 1     &amp;lt;U+FEFF&amp;gt;      feff ZERO WIDTH NO-BREAK SPACE
## 2            A      ..41    LATIN CAPITAL LETTER A
## 3           나      b098        HANGUL SYLLABLE NA
## 4           ∑      2211           N-ARY SUMMATION
## 5 &amp;lt;U+00010384&amp;gt;    010384     UGARITIC LETTER DELTA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see how the characters will be encoded in other encoding scheme using &lt;code&gt;encodings =&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;u_chars(&amp;quot;\ufeff\u0041똠\u2211\U00010384&amp;quot;, encodings = c(&amp;quot;CP949&amp;quot;, &amp;quot;latin1&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             ch codepoint enc.CP949 enc.latin1                     label
## 1     &amp;lt;U+FEFF&amp;gt;      feff                      ZERO WIDTH NO-BREAK SPACE
## 2            A      ..41        41         41    LATIN CAPITAL LETTER A
## 3           똠      b620     8c 63                 HANGUL SYLLABLE DDOM
## 4           ∑      2211     a2 b2                      N-ARY SUMMATION
## 5 &amp;lt;U+00010384&amp;gt;    010384                          UGARITIC LETTER DELTA&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;another-application&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Another application&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(stringi)
x = &amp;#39;\u0423\u043a\u0440\u0430\u0457\u043d\u0430&amp;#39;
Encoding(x)
## [1] &amp;quot;UTF-8&amp;quot;
#x = iconv(x, to=&amp;#39;UTF-8&amp;#39;) 
cat(x); cat(&amp;#39;\n&amp;#39;)
## Укра&amp;lt;U+0457&amp;gt;на
y = stri_trans_nfd(x)
cat(y); cat(&amp;#39;\n&amp;#39;)
## Укра&amp;lt;U+0456&amp;gt;&amp;lt;U+0308&amp;gt;на
u_chars(x)
##         ch codepoint                     label
## 1       У      0423 CYRILLIC CAPITAL LETTER U
## 2       к      043a  CYRILLIC SMALL LETTER KA
## 3       р      0440  CYRILLIC SMALL LETTER ER
## 4       а      0430   CYRILLIC SMALL LETTER A
## 5 &amp;lt;U+0457&amp;gt;      0457  CYRILLIC SMALL LETTER YI
## 6       н      043d  CYRILLIC SMALL LETTER EN
## 7       а      0430   CYRILLIC SMALL LETTER A
u_chars(y)
##         ch codepoint                                          label
## 1       У      0423                      CYRILLIC CAPITAL LETTER U
## 2       к      043a                       CYRILLIC SMALL LETTER KA
## 3       р      0440                       CYRILLIC SMALL LETTER ER
## 4       а      0430                        CYRILLIC SMALL LETTER A
## 5 &amp;lt;U+0456&amp;gt;      0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
## 6 &amp;lt;U+0308&amp;gt;      0308                            COMBINING DIAERESIS
## 7       н      043d                       CYRILLIC SMALL LETTER EN
## 8       а      0430                        CYRILLIC SMALL LETTER A&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://w3techs.com/technologies/cross/character_encoding/ranking&#34; class=&#34;uri&#34;&gt;https://w3techs.com/technologies/cross/character_encoding/ranking&lt;/a&gt;&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Mojibake&#34; class=&#34;uri&#34;&gt;https://en.wikipedia.org/wiki/Mojibake&lt;/a&gt;&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Unicode_Consortium&#34; class=&#34;uri&#34;&gt;https://en.wikipedia.org/wiki/Unicode_Consortium&lt;/a&gt;&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;&lt;a href=&#34;http://www.unicode.org/charts/&#34; class=&#34;uri&#34;&gt;http://www.unicode.org/charts/&lt;/a&gt;&lt;a href=&#34;#fnref4&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn5&#34;&gt;&lt;p&gt;There are several reasons for this. Unicode can embrace new languages or new characters can be found for only embraced languages.&lt;a href=&#34;#fnref5&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
      
            <tag>character</tag>
      
            <tag>preprocessing</tag>
      
            <tag>encoding</tag>
      
      
            <category>R</category>
      
    </item>
    
    <item>
      <title>About</title>
      <link>https://kwhkim.github.io/maxR/about/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://kwhkim.github.io/maxR/about/</guid>
      <description>&lt;h2 id=&#34;greetings&#34;&gt;Greetings!&lt;/h2&gt;
&lt;p&gt;Welcome to &amp;ldquo;R to the max!&amp;rdquo; This website is on R for Data Analysis. The posts are mostly translated from &lt;a href=&#34;http://ds.sumeun.org/&#34;&gt;Sumeun Data Science&lt;/a&gt;. Here is some info about the maintainer of the site. Check out the links.&lt;/p&gt;
&lt;h3 id=&#34;education&#34;&gt;Education&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Seoul National University. B.S. in Physics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Seoul National University. Ph.D. in Cognitive Science.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;books&#34;&gt;Books&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Kim, K. H. (2022). &lt;a href=&#34;http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;amp;mallGb=KOR&amp;amp;barcode=9791196014445&amp;amp;orderClick=LEa&amp;amp;Kc=&#34;&gt;R로 하는 빅데이터 분석: 데이터 전처리와 시각화.&lt;/a&gt; Big Data Analysis with R: Data preprocessing and visualization, 3rd. Sumeun. Seoul.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kim, K. H. (2019). &lt;a href=&#34;http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;amp;mallGb=KOR&amp;amp;barcode=9791196014407&amp;amp;orderClick=LAG&amp;amp;Kc=&#34;&gt;고등학교 인수분해 완전 정복.&lt;/a&gt; Conquering high school factoring. Sumeun. Seoul.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kim, K. H., Kwak. M. Y., Lee, C. S. (2017). &lt;a href=&#34;http://www.yes24.com/Product/Goods/43244145&#34;&gt;수학의 숨은 원리.&lt;/a&gt; Hidden priciples of Mathematics. Sumeun. Seoul.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kim, K. H. (2013). &lt;a href=&#34;http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&amp;amp;mallGb=KOR&amp;amp;barcode=9788961057103&amp;amp;orderClick=LAG&amp;amp;Kc=&#34;&gt;기초 통계학의 숨은 원리 이해하기.&lt;/a&gt; Understanding the hidden principles of basic statistics. Kyungmoon. Seoul.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;papers&#34;&gt;Papers&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Hyosoo Moon, Kwonhyun Kim, Hyun-Soo Lee, Moonseo Park, Trefor P. Williams, Bosik
Son, and Jae-Youl Chun(2020). Cost Performance Comparison of Design-Build and
Design-Bid-Build for Building and Civil Projects Using Mediation Analysis. Journal of Construction Engineering and Management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kim, Y. H., Jeon, J. H., Choe, E. K., Lee, B., Kim, K., Seo, J. (2016). TimeAware:
Leveraging framing effects to enhance personal productivity. In Proceedings of the
SIGCHI conference on human factors in computing systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;awards&#34;&gt;Awards&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.asce.org/career-growth/awards-and-honors/thomas-fitch-rowland-prize&#34;&gt;ASCE(&lt;strong&gt;A&lt;/strong&gt;merican &lt;strong&gt;S&lt;/strong&gt;ociety of &lt;strong&gt;C&lt;/strong&gt;ivil &lt;strong&gt;E&lt;/strong&gt;ngineers) Thomas Fitch Rowland Prize.&lt;/a&gt; &lt;a href=&#34;https://www.mk.co.kr/news/society/view/2022/02/170329/&#34;&gt;(2022).&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!---
your comment goes here
and here

## Links
Github](https://github.com/kwhkim)
Twitter](https://twitter.com/kwnhkim)
--&gt;
&lt;h3 id=&#34;links&#34;&gt;Links&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.bigbookofr.com/?fbclid=IwAR0LFCPsikgV_qgIZOhgHPCJ5ZWsSQbEEPNzm8-EM9ci0IyL5d0Jo3HvYbM&#34;&gt;Big Book of R&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.r-bloggers.com/&#34;&gt;R-bloggers&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description>
      
      
    </item>
    
  </channel>
</rss>