forked from RenTissier/Medical-Statistics-Course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-Statistical-tests.Rmd
371 lines (256 loc) · 21.8 KB
/
02-Statistical-tests.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# Statistical tests - Part 1
## Comparing 2 groups
### Motivation
We consider again the `decathlon2` dataset. After exploring the dataset and understanding its structure, we want to test the performances between the two different events, i.e. the Decastar and the OlympicGames.
```{r}
library(factoextra)
data(decathlon2)
```
When comparing the distribution of a of values from a continuous variable of two groups we commonly compare the mean of the set of values obtained in both groups. Sample means could look different but are the samples really drawn from 2 different distributions?
### Working Example
Let us consider a dataset containing two continuous variable that we will call `VariableX` and `VariableY` and a categorical variables indicating which groups samples are from.
```{r}
set.seed(352)
sample.size<-30
variableX.A <- rnorm(sample.size, mean = 2, sd = 1)
variableX.B <- rnorm(sample.size, mean = 2.4, sd = 1.1)
variableY.A <- rgamma(sample.size, shape = 2, scale = 1)
variableY.B <- rgamma(sample.size, shape = 2, scale = 1.1)
group <- rep(c('A', 'B'), each = sample.size)
dataset <- data.frame(VariableX = c(variableX.A, variableX.B),
VariableY = c(variableY.A, variableY.B),
group = group, stringsAsFactors = T)
```
We start by plotting the boxplots for each group of both variables.
```{r}
#Boxplots for variable X and Y
par(mfrow=c(1,2))
boxplot(dataset$VariableX ~ dataset$group, col = 'blue')
boxplot(dataset$VariableY ~ dataset$group, col = 'blue')
```
We can identify by 'eye' differences between the distribution of both group but how can we quantify the differences if there is any? To do so we will use statistical tests.
The most common test to compare two groups is the Student's t test, this test as the same name as the Student's t distribution as it utilizes the same distribution. Let's assume that two sets of values are drawn from distributions with expectancies $\mu_1$ and $\mu_2$, and variances $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$, respectively. The Student's test statistic is computed as follows:
$t = \frac{\overline{x_{1}}-\overline{x_{2}}}{SEDM}$
where the *standard error of difference of means* is:
$SEDM = \sqrt{SEM_{1}^{2}+SEM_{2}^{2}}$
The square root of the sum of the squared standard errors to mean of each group. The standard error to the mean can itself be computed as the standard deviation divided by the square root of the number of samples.
$SEM_{1} = \frac{\sigma_1}{\sqrt{n_1}}$
When using a statistical test two hypotheses are formulated, the null hypothesis $H_0$ and the alternative hypothesis $H_a$, respectively. In the case of the Student's t-test, the hypotheses are:
$H_0$ : $\mu_{1} = \mu_{2}$ and $H_a$ : $\mu_{1} \neq \mu_{2}$
Note that there can be 3 different alternative hypothesis depending on what we want to test. The test for alternative hypothesis presented above is called the *two sided* Student's t test as we are testing for different means but we are not making any assumptions on which direction the differences is. Two other alternative can be tested. Namely, $H_a$ = $\mu_{1} < \mu_{2}$, and $H_a$ = $\mu_{1} > \mu_{2}$. Under the null hypothesis the variable t follows a Student's distribution of degrees of freedom equal to $n1 + n2 -2$ if the variances $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ are equal. If this is not the case the statistics t can still be approximated as a Student's t distribution with a number of degrees of freedom given by the Welch procedure that we won't detail here.
Let's apply the Student's t test to our dataset:
```{r}
#t-test for VariableX
test.variableX <- t.test(dataset$VariableX ~ dataset$group)
test.variableX
```
Let's look at the results obtained. We can see that the estimated means are equal to `r test.variableX$estimate[1]` and `r test.variableX$estimate[2]` in group A and group B respectively. This lead to an estimated difference in mean of `r test.variableX$estimate[1]-test.variableX$estimate[2]`. The computed test statistic is equal to `r test.variableX$statistic`. The number of degrees of freedom estimated is equal to `r test.variableX$parameter`. We can then plot the density distribution of this Student's t random variable.
```{r}
curve(dt(x, df = test.variableX$parameter), col = 'blue', from = -10,
to = 10, ylab = 'density')
abline(v = qt(c(0.025, 0.975), df = test.variableX$parameter),
col = 'red')
abline(v = test.variableX$statistic, col= 'green')
```
As we can see the value of the statistics is outside the range delimited by the 2.5% quantile and the 97.5% quantile. Therefore it is unlikely that the statistics is drawn from such random variable. It allows us to reject the null hypothesis and accept the alternative hypothesis with less than 5% chance risk of making a mistake. The p-value obtained from the test is the probability that the statistics is drawn from the hypothetical Student's t distribution under the null hypothesis from quantiles table for example.
An important output information from the output is also the 95% confidence interval. This interval quantifies the uncertainty of the estimate of the mean difference. The 95% confidence interval is a range of values that you can be 95% certain contains the true mean difference between the two populations. As we can see 0 is not in this range allowing us to, once again, reject the null hypothesis of equality of the means. The confidence interval can be computed by using the SEDM and the mean difference obtained in the test. For a level of confidence $\alpha$ of a two sided Student test is give by the formula:
$high Boundary = mean + t_{1-\frac{\alpha}{2}}*SEDM$
$low Boundary = mean - t_{1-\frac{\alpha}{2}}*SEDM$
We can see from this formula that two factors can influence the size of the confidence interval. Indeed, the SEDM is a function depending on the standard deviation and the number of samples in each group. On one hand, increasing the number of sample will improve the accuracy of the test and narrow the confidence interval. On the other hand, samples with large standard deviation will lead to larger confidence intervals. We can compute the confidence intervals of the test we made earlier:
```{r}
low.boundary <- test.variableX$estimate[1] - test.variableX$estimate[2] -
qt(0.975, df = test.variableX$parameter)*test.variableX$stderr
high.boundary <- test.variableX$estimate[1] - test.variableX$estimate[2] +
qt(0.975, df = test.variableX$parameter)*test.variableX$stderr
c(low.boundary, high.boundary)
```
The Student's t test is a great tool to compare two distributions. However, it has its weaknesses. In case of skewed data, the standard deviation of both compared samples can be relatively different and impact the power of the test to detect relevant differences. In such cases, it is preferable to transform the data using for example a log transformation. For the Student's t test it is requested that the distributions of the samples values are following a normal distribution or to have at least a sample size of 30 in each group (central limit theorem). If this is not the case, it is better to use a non parametric test called Mann-Whitney-Wilcoxon test (or Mann-Whitney U test or Wilcoxon sum rank test).
The Mann-Whitney-Wilcoxon test is non parametric as it does not test a specific parameter to differentiate two populations. Instead we will test if the probability that an observation in group A is superior to an observation from group B is different from the probability that an observation from group B is superior from group A. The main assumption of the Mann-Whitney-Wilcoxon test is that the samples from both groups must be independent so it cannot be used for paired observations. For paired observations we would prefer the Wilcoxon signed-rank test that we won't detail here. If the observations in group A are drawn from a random variable X and the observation in group B are coming from a random variable Y the null and alternative hypotheses are:
$H_0$ : $P(X > Y) = P(Y > X)$ and $H_a$ : $P(X > Y) \neq P(Y > X)$
To simplify the test relies on comparing the medians of both populations. The statistic of the test called U is computed as follows. All observations are pooled together and ranked based on their values. For example if we have a pooled set of 6 distinct values (2, 4, 4, 4, 1, 5), the ranked version of this set of observations becomes (2, 4, 4, 4, 1, 6). Let's call $R_1$. We can then compute U as:
U = $R_1 - \frac{(n_1)(n_1 + 1)}{2}$
With this statistics U we can then compute the p-value using either a Wilcoxon Rank-Sum Table in case of small samples and approximating U as a normal distribution for large samples:
```{r}
wilcox.test(dataset$VariableY ~ dataset$group, alternative = "two.sided")
```
We can notice that the output of the Mann-Whitney-Wilcoxon test is smaller than for the Student's t test. Indeed as it is a non parametric test it does not provide direct effect size, but we can obtain confidence interval by setting `conf.int` argument to `TRUE`. Furthermore, relying only on the p-value is not correct and one would prefer to report also the difference between the median of both samples.
```{r results="asis", echo=FALSE, comment=""}
catLinkTaskSection("exercises/01-comparing-2-groups.Rmd" );
```
## Power of a statistical test
### Motivation
After looking at the results of the group comparison using the `decathlon2` dataset the researcher wants to build another study in order to prove that the impact of the material used to build the surface of the stadium can impact the performance of the racing events and the long jump event. While building the protocol of his study the researcher encounters the question: How many sportsmen should he recruit in order to obtain satisfactory testing results?
```{r}
library(factoextra)
data(decathlon2)
```
### Working example
To answer this question we need to introduce an important notion in statistical modeling. The power of a statistical test. Let's simulate two variable for 2 different group. To do so we will have a look at two tests:
```{r}
set.seed(352)
#Creation dataset
variableX.A <- rnorm(10, mean = 0, sd =1)
variableX.B <- rnorm(10, mean = 2, sd =1)
variableY.A <- rnorm(10, mean = 0, sd =1)
variableY.B <- rnorm(10, mean = 0.3, sd =1)
dataset <- data.frame(VariableX = c(variableX.A, variableX.B),
VariableY = c(variableY.A, variableY.B),
group = c(rep('A', 10), rep('B',10)))
```
We created both Variables to have a different means depending on the group. Now let's test these differences using the Student's t test.
```{r}
t.test(VariableX ~ group, data = dataset)
t.test(VariableY ~ group, data = dataset)
```
We can see that even if we simulated both variables from different distributions depending on the group. We ended up with one test detecting significant differences between the groups and the other one not detecting differences. Typically, there is 4 different possible outcomes from a statistical test::
Reality \ Test results | Non-significant | Significant |
------------|------|-----|
$H_0$ True | $1-\alpha$ (True Negative) | $\alpha$ (False Positive) |
$H_0$ False | $\beta$ (False Negative) | $1-\beta$ (True Positive) |
The $\alpha$ represents they type I error, i.e. the probability of finding a significant difference when there is none. This coefficient $\alpha$ is often fixed to 5%. The $\beta$ coefficient represent the probability to obtain a non-significant test while the alternative hypothesis is true and is called type II error. This is the case when we test the VariableY that we simulated. The power of a statistical test correspond to the values 1-$\beta$, and is the probability to reject the null hypothesis appropriately.
### Illustrating two-sided critical regions
We will illustrate the notion of power. For `variableX`, we have the standard normal for group `A` and a mean 2 for group `B`. We don't know in advance that values in group `B` are all larger than those in group `A`, so we use a two-sided test to compare the 2 groups. The distribution of the values under the null hypothesis is a Student-t with 20-2 degrees of freedom, and the probability density function for it, with the rejection regions, is given below. Here we use $\alpha = 0.05$, so we assign half of that probability of making a type-I error to each tail.
```{r}
x <- seq(from = -5, to = 5, by = 0.01)
df <- 20-2
y <- dt(x, df = df)
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under the null hypothesis",
ylim = c(0, max(y) + .1*diff(range(y))))
# Get quantile for 0.05 upper-tail prob of a standard normal
x.up <- qt(0.975, df = df)
x.lo <- qt(0.025, df = df)
# create a red colour with some transparency
redt <- rgb(red = 1, green = 0, blue = 0, alpha = 0.5)
polygon(c(x[ x >= x.up ], max(x), x.up), c(y[ x >= x.up], 0, 0), col = redt)
polygon(c(min(x), x[ x <= x.lo ], x.lo), c(0, y[ x <= x.lo], 0), col = redt)
legend("topleft", legend = "rejection regions", fill = redt)
segments(x.up, 0, x.up, 1, lty = "dashed", col = "gray", lwd = 2)
text(x.up, max(y) - .1*diff(range(y)), "critical level", pos = 4)
segments(x.lo, 0, x.lo, max(y), lty = "dashed", col = "gray", lwd = 2)
text(x.lo, max(y) - .1*diff(range(y)), "critical level", pos = 2)
```
### Illustrating the power
Say we now want to be able to detect an effect size of 2, given that the variance in the second group is the same as in the first group. To compute the power, we evaluate the probability of rejecting the null hypothesis, if the data is generated according to the data distribution under the alternative hypothesis. Here we use only a positive effect size, for illustration.
First we make the graph of the probability density functions under the null and the alternative hypotheses.
```{r}
y.a <- dt(x, df = df, ncp = 2)
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under different hypotheses",
ylim = c(0, max(y) + .1*diff(range(y))), lwd = 2)
lines(x, y.a, col = "blue", lwd = 2)
legend("topleft", legend = c("Null", "Alternative"), lty = "solid", lwd = 2,
col = c("black", "blue"))
```
Let us now add the critical regions.
```{r}
y.a <- dt(x, df = df, ncp = 2)
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under different hypotheses",
ylim = c(0, max(y) + .1*diff(range(y))), lwd = 2)
lines(x, y.a, col = "blue", lwd = 2)
legend("topleft", legend = c("Null", "Alternative"), lty = "solid", lwd = 2,
col = c("black", "blue"))
polygon(c(x[ x >= x.up ], max(x), x.up), c(y[ x >= x.up], 0, 0), col = redt)
polygon(c(min(x), x[ x <= x.lo ], x.lo), c(0, y[ x <= x.lo], 0), col = redt)
legend("topright", legend = "rejection regions", fill = redt)
segments(x.up, 0, x.up, 1, lty = "dashed", col = "gray", lwd = 2)
segments(x.lo, 0, x.lo, max(y), lty = "dashed", col = "gray", lwd = 2)
```
Then the power is the probability of rejecting the null hypothesis, i.e. obtaining a value in a critical region, if the data is generated by the distribution under the alternative (in blue):
```{r}
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under different hypotheses",
ylim = c(0, max(y) + .1*diff(range(y))), lwd = 2)
lines(x, y.a, col = "blue", lwd = 2)
legend("topleft", legend = c("Null", "Alternative"), lty = "solid", lwd = 2,
col = c("black", "blue"))
# create a red colour with some transparency
bluet <- rgb(red = 0, green = 0, blue = 1, alpha = 0.5)
polygon(c(x[ x >= x.up ], max(x), x.up), c(y[ x >= x.up], 0, 0), col = redt)
polygon(c(min(x), x[ x <= x.lo ], x.lo), c(0, y[ x <= x.lo], 0), col = redt)
legend("topright", legend = "rejection regions", fill = redt)
segments(x.up, 0, x.up, 1, lty = "dashed", col = "gray", lwd = 2)
segments(x.lo, 0, x.lo, max(y), lty = "dashed", col = "gray", lwd = 2)
polygon(c(x[ x >= x.up ], max(x), x.up), c(y.a[ x >= x.up], 0, 0), col = bluet)
legend("bottomleft", legend = "power", fill = bluet)
```
### What happens if the effect size is smaller?
In the example, `variableY` had values in group `B` generated with an effect size of 0.3. Let us now compare the power obtained with an effect size of 2, as those for `variableX`, with those for this smaller effect size. First we plot only the probability density functions:
```{r}
y.a <- dt(x, df = df, ncp = .3)
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under different hypotheses",
ylim = c(0, max(y) + .1*diff(range(y))), lwd = 2)
lines(x, y.a, col = "blue", lwd = 2)
legend("topleft", legend = c("Null", "Alternative"), lty = "solid", lwd = 2,
col = c("black", "blue"))
polygon(c(x[ x >= x.up ], max(x), x.up), c(y[ x >= x.up], 0, 0), col = redt)
polygon(c(min(x), x[ x <= x.lo ], x.lo), c(0, y[ x <= x.lo], 0), col = redt)
legend("topright", legend = "rejection regions", fill = redt)
segments(x.up, 0, x.up, 1, lty = "dashed", col = "gray", lwd = 2)
segments(x.lo, 0, x.lo, max(y), lty = "dashed", col = "gray", lwd = 2)
```
Now we colour the area corresponding to the power:
```{r}
plot(x, y, type="l", xlab = "variable values", ylab = "density",
main = "Distribution of X under different hypotheses",
ylim = c(0, max(y) + .1*diff(range(y))), lwd = 2)
lines(x, y.a, col = "blue", lwd = 2)
legend("topleft", legend = c("Null", "Alternative"), lty = "solid", lwd = 2,
col = c("black", "blue"))
# create a red colour with some transparency
bluet <- rgb(red = 0, green = 0, blue = 1, alpha = 0.5)
polygon(c(x[ x >= x.up ], max(x), x.up), c(y[ x >= x.up], 0, 0), col = redt)
polygon(c(min(x), x[ x <= x.lo ], x.lo), c(0, y[ x <= x.lo], 0), col = redt)
legend("topright", legend = "rejection regions", fill = redt)
segments(x.up, 0, x.up, 1, lty = "dashed", col = "gray", lwd = 2)
segments(x.lo, 0, x.lo, max(y), lty = "dashed", col = "gray", lwd = 2)
polygon(c(x[ x >= x.up ], max(x), x.up), c(y.a[ x >= x.up], 0, 0), col = bluet)
legend("bottomleft", legend = "power", fill = bluet)
```
The power of a statistical test depends on 4 different parameters, the size of the effect (in the example the effect size is small), the sample size, the pooled variance of the samples, and the significance level of the test $\alpha$.
If the effect size of the phenomenon we want to study cannot be easily changed, we can influence the three others to obtain a good power:
- Increasing the sample size
- Decreasing the variability of the samples
- Increasing the level $alpha$ , i.e. increasing the chance of false positives
As the latter is not advisable in most studies, the easiest approach to influence the power of the test is to increase the sample size. Let's compute the power to reject the null hypothesis in our example. To do so we can for example use the function `pwr.t.test` or power from the R package *pwr*. The function has four different argument :
- n is the sample size in 1 group
- d is the effect size: difference between the means of each group divided by the pooled standard deviation of both groups
- sig.level: significance level
- power: power of the test
```{r}
library(pwr)
#Power computation VariableX
pwr.t.test(n = 10, d = 2/1, sig.level = 0.05, power = NULL,
type = c("two.sample"), alternative = c("two.sided"))
#Power computation VariableY
pwr.t.test(n = 10, d = 0.5/1, sig.level = 0.05, power = NULL,
type = c("two.sample"), alternative = c("two.sided"))
```
We can see that the power to reject the null hypothesis for `VariableX` is 98.8% but for `VariableY` it is only 18.5%. As you can see the relation between the power and the effect size is not linear as the effect size is 4 times larger (0.5 to 2) but the power to reject $H_0$ is more than 5 times larger for `VariableX`. For simplicity we will not present, here, the exact formula to compute the power as it involves more statistical knowledge than the aim of this course.
We can plot the evolution of the power as a function of the effect size:
```{r}
#computation of the power for different sample size
powers <- pwr.t.test(n = c(1:30), d = 0.5/1, sig.level = 0.05, power = NULL,
type = c("two.sample"), alternative = c("two.sided"))$power
#Plot of the power as a function of the sample size
plot(c(1:30), powers)
```
We see that even with a sample size of 30 the power to detect the difference between the two groups is only `r powers[30]`. Instead we can compute what would be the sample size needed for a given a power. The common desired power requested in studies are 80% and 90%. We can compute the sample size needed for both power for the `VariableY`.
```{r}
#computation of sample size for a fixed power of 80%
test.80.percent <- pwr.t.test(n = NULL, d = 0.5/1, sig.level = 0.05, power = 0.8,
type = c("two.sample"),alternative = c("two.sided"))
test.80.percent
#computation of sample size for a fixed power 0f 90%
test.90.percent <- pwr.t.test(n = NULL, d = 0.5/1, sig.level = 0.05, power = 0.9,
type = c("two.sample"), alternative = c("two.sided"))
test.90.percent
```
We can see that we would need `r round(test.80.percent$n, 0)` samples in each group for a power of 80% and `r round(test.90.percent$n, 0)` for a power of 90%.
We have seen that in order to determine the proper sample size needed in a study we need to know or to give an estimate of the effect sizes we want to test, the standard deviation existing in the sample and to fix both type I error and the power to detect a true positive. To do so the approach often used is to go back to the literature or to used pilot studies.
Note that for the Student's t test obtaining power and sample sizes is relatively simple due to the exact mathematical formula linking all 4 parameters. But for a lot of test (especially non parametric ones) there are no exact formulas and simulations are needed. Hopefully, there are a lot of function in R that are able to do the simulation and power computations.
```{r results="asis", echo=FALSE, comment=""}
catLinkTaskSection("exercises/02-poweranalysis.Rmd" );
```