-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChapter_2.Rmd
95 lines (70 loc) · 2.91 KB
/
Chapter_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Monte Carlo Simulations
```{r, warning=F, message=F}
library(tidyverse)
```
## The Confidence Interval
In this exercise I will try to repeat the example given by [Gerko Vink](https://www.gerkovink.com/markup/Wk1/Solution_to_Ex1.html)
The main idea of this exercise is to illustrate the nature of the *Confidence Interval* as described by [Neyman (1934)](http://www.stat.cmu.edu/~brian/905-2008/papers/neyman-1934-jrss.pdf)
We set a seed to make our results reproducible:
```{r}
set.seed(6465)
```
- The first step is to take 100 samples (in this case of size 800) from a *normal distributuon* with $\mu = 0$ and $\sigma = 1$:
```{r}
samples <-plyr::rlply(100, rnorm(800, 0, 1))
```
- Secondly, we need to calculate for the mean of each sample: the absolute bias; standard error lower bound of the 95% confidence interval and upper bound of the 95% confidence interval.
We can construct a function that does this:
```{r}
samp_function <- function(x) {
m <- mean(x)
n <- length(x)
se <- 1/sqrt(n)
bias <- abs(-0 - m)
df <- n - 1
interval <- qt(.975, df) * se
return(c(m, bias, se, m - interval, m + interval))
}
format <- c("Mean" = 0, "Bias" = 0, "Std.Err" = 0, "Lower" = 0, "Upper" = 0)
```
Now we use the constructed function `samp_function` on all 100 samples contained in the object `samples`. And we also add a new column to the results that indicates which CI of the respective samples does contain $\mu$.
```{r}
results <- samples %>%
vapply(., samp_function, format) %>%
t %>%
as_tibble %>%
mutate(Covered = ifelse(Lower < 0 & Upper > 0, 1, 0))
```
We can also add a table with the sample statistics of the samples whose CI's do not contain $\mu$.
```{r}
results %>%
filter(Covered ==0) %>%
kableExtra::kable(caption = "Here is a table of the samples" )
```
And finally we can also make a nice plot illustrating everything that we did so far.
```{r, fig.cap= "Here is a plot of the CI's"}
lims <- aes(ymax = results$Upper, ymin = results$Lower)
ggplot(results, aes(y=Mean, x=1:100, colour = Covered)) +
geom_hline(aes(yintercept = 0)) +
geom_pointrange(lims) +
xlab("Simulations") +
ylab("Means and 95% Confidence Intervals")
```
In this case only two out of 100 CI's do not include the true population mean.
## The Central Limit Theorem
Here we will also try to illustrate the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), in it's most basic form, with a very simple example.
First we draw 1000 samples (again of size 800), form , say, a *Poisson* distribution, of course we could've drawn them from a uniform or an exponential as well.
```{r}
samples_2 <- samples <-plyr::rlply(1000, rpois(800, 2))
```
Now we calculate the mean for each sample:
```{r}
means <- samples_2 %>%
lapply(., mean) %>%
as.data.frame() %>%
t()
```
And now we plot a histogram of the resulting means:
```{r, fig.cap= "Histogram of the sampling distribution of the mean"}
hist(t(means))
```