generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path05_2_pertussis_sol.Rmd
212 lines (150 loc) · 6.28 KB
/
05_2_pertussis_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: "Exercise 5.2: Hypothesis testing on the pertussis example - solution"
author: "Lieven Clement, and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aim of this tutorial
Upon this exercise you can
1. conduct a statistical hypothesis test for a two group comparison in R and to interpret the results.
2. critically evaluate the assumptions for a two sample t-test
3. use simulations to assist you when evaluating the assumptions of a two group comparison
4. select the correct test based on the data exploration.
# Background
Researchers wanted to study the immune response on pertussis.
They have set up an experiments with 40 rats.
16 rats were infected with pertussis and 24 rats received a control treatment.
Researchers measured the white blood cell concentration (WBC) in each rat (count per mm$^3$).
De data consists of two variables:
- WBC: white blood cell count (counts/mm$^3$).
- trt: treatment
- control: rat recieved control treatment
- pertussis: rat was infected with pertussis
# Import the dataset
Load the libraries
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
Data path:
`https://raw.githubusercontent.com/statOmics/PSLSData/main/wbcon.csv`
```{r}
wbcon <- read_csv("https://raw.githubusercontent.com/statOmics/PSLSData/main/wbcon.csv")
```
```{r}
glimpse(wbcon)
```
## Aim of the study
The overarching goal of this study was to assess if the white blood cell count changes upon pertussis infection. To this end, researchers randomized 40 rats
to two treatments: A control treatment and a treatment in which the rat was infected with pertussis.
# Data exploration
A crucial first step in a data analysis is to visualize and to explore the raw data.
This will allow us to gain insight in the data.
A secondary goal of the data exploration is to check the assumptions of the test that we will perform.
## Which test will you perform to compare the white blood cell count between infected and control rats?
40 different rats are assigned at random to a control treatment or a treatment where they are infected with pertussis.
The white blood cells of the rats are measured on different animals and we can assume that the data between rats are independent.
Note, that this will not be the case if multiple rats are housed in the same cages!
We will therefore perform a two-sample t-test to assess if the average white blood cell count differs between infected rats and rats that received the control treatment.
## What are the assumptions of this test?
1. The data are independent
2. The data in both groups have the same variance
3. The data in both groups are normally distributed.
## Boxplots
We can assess the second assumption using boxplots.
```{r}
wbcon %>% ggplot(aes(x = trt, y = WBC, fill = trt)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = "jitter") +
ylab("WBC (count/mm3)") +
xlab("treatment") +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3, color = "black"
)
```
What do you observe?
Both the mean and variance of the data seems to differ between control rats and rats infected with pertussis.
So the second assumption is not valid.
If the data are normally distributed we can compare the groups using a Welch modified t-test, which is valid if the data in both groups are exhibiting a different variance.
## QQ-plots
To assess the assumption that the data are normally distributed in each treatment group, we will use QQ plots.
```{r}
wbcon %>%
ggplot(aes(sample = WBC)) +
geom_qq() +
geom_qq_line() +
facet_grid(cols = vars(trt))
```
What do you observe?
The white blood cell counts appear to be normally distributed in both treatment groups.
# Assess the research question with the appropriate t-test
## Analysis
```{r}
output <- t.test(WBC ~ trt, data = wbcon, var.equal = FALSE)
output
```
## Conclusion
The average white blood cell count is extremely different between rats that are infected with pertussis and rads that received a control treatment (p << 0.001).
The white blood cell count is on average `r output$estimate %>% diff %>% abs %>% format(.,digits = 5)` blood cells/mm$^3$ higher for rats with pertussis than for rats that received a control treatment (95% CI [`r output$conf.int %>% abs %>% sort %>% format(.,digits = 5)`]).
# ADDENDUM: Train yourself in checking the assumptions
In order for the learners to get more proficient in evaluating the assumptions we will simulate 8 dataset with sample sizes similar to our data for which the assumptions of normality and equal variance do hold. For the QQ-plots we will only plot the one from one of the groups.
## Simulate data
We simulate 8 datasets with the same sample sizes, means and pooled variance as in the sample.
```{r}
set.seed(21532)
nSim <- 8
## descriptive statistics
wbcSum <- wbcon %>%
group_by(trt) %>%
summarize(
mean = mean(WBC, na.rm = TRUE),
sd = sd(WBC, na.rm = TRUE),
n = n()
) %>%
mutate(se = sd / sqrt(n))
sigma <- sqrt(sum(wbcSum$sd^2 * (wbcSum$n - 1)) / (sum(wbcSum$n) - 2))
normSim <- matrix(
c(
wbcon$WBC,
rnorm(sum(wbcSum$n) * nSim,
mean = c(
rep(wbcSum$mean[1], wbcSum$n[1]),
rep(wbcSum$mean[2], wbcSum$n[2])
),
sd = sigma)
),
nrow = sum(wbcSum$n)) %>%
as.data.frame() %>%
mutate(trt = wbcon$trt)
```
## Comparisons of variances
```{r}
names(normSim)[1:(nSim+1)] <- c("orig",paste0("sim",1:nSim))
normSim %>%
gather(samp, data, -trt) %>%
ggplot(aes(x = trt, y = data)) +
geom_boxplot() +
facet_wrap(~samp)
```
We observe that the difference in variability of the original observations (top left plot) is larger than in the each simulations. Indicating that there is indeed evidence that the variances in each group differ.
## Evaluation of normality
### Control group
```{r}
normSim %>%
gather(samp, data, -trt) %>%
filter(trt == "control") %>%
ggplot(aes(sample = data)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~samp)
```
### Pertussis group
```{r}
normSim %>%
gather(samp, data, -trt) %>%
filter(trt == "pertussis") %>%
ggplot(aes(sample = data)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~samp)
```
The data in each group does not show larger deviations from normality than what we observe in each of the eight datasets that were simulated under the model assumptions.