-
Notifications
You must be signed in to change notification settings - Fork 10
/
lab_statistics.Rmd
288 lines (220 loc) · 10 KB
/
lab_statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: "Basic statistic"
subtitle: "R Programming Foundation for Data Analysis"
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
<br>
```{r,child="assets/header-lab.Rmd"}
```
# Introduction
In this lab, we will use statistics to gain a deeper understanding and insight into the dataset. Through statistical methods, we can evaluate data quality, identify patterns, and detect any outliers that may be present. You will apply several descriptive measures to explore and summarize the dataset, allowing for a more informed analysis.
# Generating the Data
We create a dataset which consists of:
- Two numerical variables (Normally distributed).
- Three categorical.
**Note:** The following chunk has a function by which you generate a random dataset (`n` samples with `[0-100]%` **missing data**). You can also adjust these values to create a new dataset. We will talk about **functions** tomorrow.
```{r generate-data, accordion = T}
# Function to generate the data
generate_dataset <- function(n, na_frac) {
# Set seed for reproducibility
set.seed(123)
# Gender variable
Gender <- sample(c('Male', 'Female'), n, replace = TRUE)
# Continuous Variable 1 with shifted distribution between genders
Variable_1 <- ifelse(Gender == 'Male', rnorm(n, mean = 55, sd = 10), rnorm(n, mean = 45, sd = 10))
# Continuous Variable 2: negative correlation with Variable_1 for one of the sexes
Variable_2 <- ifelse(Gender == 'Male', -0.5 * Variable_1 + rnorm(n, mean = 10, sd = 5),
0.5 * Variable_1 + rnorm(n, mean = 10, sd = 5))
# Introduce NA values (10% of the data by default)
Variable_1[sample(1:n, round(n*na_frac))] <- NA
Variable_2[sample(1:n, round(n*na_frac))] <- NA
# Introduce outliers
Variable_1[c(n)] <- max(Variable_1, na.rm = T) + sd(Variable_1, na.rm = T)
Variable_1[c(round(n)/2)] <- min(Variable_1, na.rm = T) - sd(Variable_1, na.rm = T)
Variable_2[c(n)] <- max(Variable_2, na.rm = T) + sd(Variable_2, na.rm = T)
Variable_2[c(round(n)/2)] <- min(Variable_2, na.rm = T) - sd(Variable_2, na.rm = T)
# Categorical variables with more levels for Spearman correlation
Category_1 <- sample(c('Category_A', 'Category_B', 'Category_C', 'Category_D', 'Category_E'), n, replace = TRUE)
Category_2 <- sample(c('Group1', 'Group2', 'Group3'), n, replace = TRUE)
# Create data frame
dataset <- data.frame(Variable_1, Variable_2, Category_1, Category_2, Gender)
return(dataset)
}
n <- 1000 # Number of samples (observation).
na_frac <- 0.1 # Introducing missing values in the dataset.
dataset <- generate_dataset(n, na_frac = na_frac )
```
- Check the content of the dataset.
- How many rows and columns do you have?
```{r head,accordion=TRUE}
head(dataset)
dim(dataset)
```
# Measures of Central Tendency
## Mean
```{r mean, accordion = T}
mean_var1 <- mean(dataset$Variable_1)
mean_var2 <- mean(dataset$Variable_2)
mean_var1
mean_var2
```
- What is the `mean` value of `Variable_1` and `Variable_2`?
- Is it NA? Why?
- How many NA values do you find in your dataset?
```{r n-na, accordion = T}
n_na <- sum(is.na(dataset))
sprintf('There are %g missing values', n_na)
```
- Can you fix it?
<details>
<summary>Click to expand</summary>
To handle the `NA` values, you can set `na.rm = TRUE` in the function `mean(x, na.rm = T)`, which removes the queries with `NA` values.
</details>
You can plot the distribution of the values by `hist()`. You will learn more about plots in **Basic Graphics** lecture.
```{r hist, accordion = T}
hist(dataset$Variable_1, main = 'Distribution of Variable_1', xlab = 'Variable_1')
```
Can you create a separate histogram for each gender?
```{r gender-plot,accordion = T}
dataset_f <- dataset[(dataset$Gender == 'Female'), ]
dataset_m <- dataset[(dataset$Gender == 'Male'), ]
hist(dataset_f$Variable_1, main = 'Female', xlab = 'Variable_1')
hist(dataset_m$Variable_1, main = '', xlab = '')
```
## Median
```{r median, accordion = T}
median_var1 <- median(dataset$Variable_1, na.rm = T)
median_var2 <- median(dataset$Variable_2, na.rm = T)
median_var1
median_var2
```
# Measures of Spread
## Range
We can find the range of values in numerical variables.
```{r range, accordion = T}
range_var1 <- range(dataset$Variable_1, na.rm = TRUE)
range_var2 <- range(dataset$Variable_2, na.rm = TRUE)
range_var1
range_var2
```
For categorical variables we can count the queries.
```{r table-cat, accordion = T}
table(dataset$Category_1)
table(dataset[,c('Category_1', 'Category_2')])
table(dataset[,c('Category_1', 'Category_2', 'Gender')])
```
## Interquartile.
Calculate the following for `Variable_1`:
- Min.
- Median.
- Max.
- Quantiles (q1:25% and q2:75%) by using `quantile` function.
- Interquartile range (q3 - q1).
**Note:** Make sure to control for missing values (`na.rm = T`)
```{r interquartile, accordion = T}
min_val <- min(dataset$Variable_1, na.rm = T)
q1 <- quantile(dataset$Variable_1, 0.25, na.rm = T)
median_val <- median(dataset$Variable_1, na.rm = T)
q3 <- quantile(dataset$Variable_1, 0.75, na.rm = T)
max_val <- max(dataset$Variable_1, na.rm = T)
print('Variable_1 range:')
c(min_val, q1, median_val, q3, max_val)
# Interqurtile:
print('Interquartile:')
iqr_val1 <- q3 - q1
unname(iqr_val1)
# or
iqr_val1 <- IQR(dataset$Variable_1, na.rm = T)
iqr_val1
```
## Standard deviation
- `sd` is a built-in function in R to calculate standard deviation.
- Calculate standard deviation of `Variable_1` for males and females separately.
**Note:** Make sure to control for missing values (`na.rm = T`)
```{r sd, accordion = T}
sd_var1_f <- sd(dataset_f$Variable_1, na.rm = TRUE)
sd_var1_m <- sd(dataset_m$Variable_1, na.rm = TRUE)
sd_var1_f
sd_var1_m
```
## Variance
- `var` is a built-in function in R to calculate variance.
- Calculate variance of `Variable_1` for males and females separately.
**Note:** Make sure to control for missing values (`na.rm = T`).
```{r var, accordion = T}
var_var1_f <- var(dataset_f$Variable_1, na.rm = TRUE)
var_var1_m <- var(dataset_m$Variable_1, na.rm = TRUE)
var_var1_f
var_var1_m
```
# Correlation
## Pearson's correlation
- Let's check the correlation of `Variable_1` and `Variable_2` by Pearson's method implemented in `cor` function. You can specify the method by `method = 'pearson'`.
**Note:** Here we handle the missing values a bit differently. In `cor` function, you control the missing values (`NA`) by setting `use = 'complete.obs'` which means to include observations that there are values for both variables. In other words, rows with `NA` in either of the values will be excluded.
```{r pearson, accordion = T}
pearson_correlation <- cor(dataset$Variable_1, dataset$Variable_2, use = 'complete.obs', method = "pearson")
pearson_correlation
```
- You can also plot the values in a simple plot using `plot` function. How do you interpret the results?
```{r v1v2plot, accordion = T}
plot(dataset$Variable_1, dataset$Variable_2, main = 'Variable_1 vs Variable_2', xlab = 'Variable_1', ylab = 'Variable_2')
```
## Spearman's correlation
- You can calculate Spearman's correlation using the same function (`cor`) for both numerical and categorical data.
- Check the result of Spearman's correlation and Pearson's correlation for `Variable_1` and `Variable_2`.
- Calculate the correlation between `Category_1` and `Variable_1`. Consider `Category_1` with ordinal variable such as level of satisfaction labeled as **Category_A to E**.
**Note:** Categorical_data cannot be directly passed to `cor`. You first need to encode the values as factors (the values are presented in form of `levels`) by `as.factor`. Then by converting the vector (e.g. `dataset$Category_1`) to numeric by `as.numeric` function, you can pass it to `cor` function.
```{r spearman, accordion = T}
# Spearman correlation for numerical variables
spearman_corr_numeric <- cor(dataset$Variable_1, dataset$Variable_2, use = "complete.obs", method = "spearman")
# Spearman correlation between numeric and categorical variables
spearman_corr_cat_var1 <- cor(as.numeric(as.factor(dataset$Category_1)), dataset$Variable_1, use = "complete.obs", method = "spearman")
spearman_corr_cat_var2 <- cor(as.numeric(as.factor(dataset$Category_1)), dataset$Variable_2, use = "complete.obs", method = "spearman")
spearman_corr_numeric
spearman_corr_cat_var1
spearman_corr_cat_var2
```
- Plot categorical data using `plot` function can be fun! If you pass the categorical as factor ( `as.factor(dataset$Category_1)`) together with a numerical values, you will generate a boxplot.
```{r boxplotcatg1var1, accordion = T}
plot((as.factor(dataset$Category_1)), dataset$Variable_1)
```
## Optional exercise
- Calculate the correlation between `Gender` and `Variable_1`, `Gender` as well as `Variable_2`.
```{r gendervarscor}
cor_gender_var1 <- cor(as.numeric(as.factor(dataset$Gender)), dataset$Variable_1, use = 'complete.obs', method = 'spearman')
cor_gender_var2 <- cor(as.numeric(as.factor(dataset$Gender)), dataset$Variable_2, use = 'complete.obs', method = 'spearman')
cor_gender_var1
cor_gender_var2
```
- Now plot `Gender` vs `Variable_1` or `Variable_2`. How do you interpret the result?
```{r gendervarsplot1}
plot((as.factor(dataset$Gender)), dataset$Variable_1)
```
```{r gendervarsplot2}
plot(as.factor(dataset$Gender), dataset$Variable_2)
```
# Bonus
You can show both histograms (males and females) in the same plot. You will learn about this in "Basic Graphics".
```{r bonus, accordion = T, eval = T, echo = T}
par(mfrow = c(2,1), mar = c(5, 4, 1, 2) + 0.1)
range_val <- c(min(dataset$Variable_2, na.rm = T), max(dataset$Variable_2, na.rm = T) )
hist(dataset_f$Variable_2, main = 'Female', xlab = 'Variable_2', xlim = range_val, breaks = 30)
hist(dataset_m$Variable_2, main = 'Male', xlab = 'Variable_2', xlim = range_val, breaks = 30)
```