forked from TrevorFrench/R-for-Data-Analysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
p3c3-outliers.qmd
123 lines (80 loc) · 3.64 KB
/
p3c3-outliers.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Outliers
Outliers are observations that fall outside the expected scope of the dataset. It's important to identify outliers in your data and determine the necessary treatment for them before moving into the next analysis phase.
For example, it might be necessary to impute values, remove a row, perform sensitivity analysis, or choose analysis methods that are robust in the presence of outliers.
## Finding Outliers Visually
One common first step many people employ when looking for outliers is visualizing their datasets so that extreme values can quickly be spotted This section will briefly cover several common visualizations used to identify outliers; however, each of these plots will be explored more in-depth later in the book.
### Scatter Plot
This is probably the first plot you'll reach for when trying to visualize outliers in your data. The scatter plot is a great tool to quickly visualize your data at a high level and see if anything major stands out.
```{r}
plot(mtcars$mpg)
```
Here's how a scatter plot with an extreme outlier might look.
```{r}
data <- c(1,4,7,9,2,6,3,99,4,2,7,8)
plot(data)
```
### Box Plot
Another way to quickly visualize outliers is to use the "boxplot" function. This plot will allow you to evaluate outliers in a more systematic way.
```{r}
boxplot(mtcars$mpg)
```
The solid black line represents the median value of your dataset. The top and bottom "whiskers" represent your extreme values (minimum and maximum). The top and bottom of the "box" represent the first and third quartile.
Here's an example of a box plot with an extreme outlier.
```{r}
boxplot(data)
```
### Histogram
Histograms will allow you to see how often values occur within certain buckets.
```{r}
hist(mtcars$mpg)
```
Here's a histogram with data that contains an outlier.
```{r}
hist(data)
```
### Density Plot
Density plots can be thought of as a smoothed version of a histogram. (You
can tune the degree of smoothing, e.g. via the `adjust` argument to the
`density()` function.)
```{r}
plot(density(mtcars$mpg))
```
Here's an example of a density plot with data that contains an outlier.
```{r}
plot(density(data))
```
## Finding Outliers Statistically
While examining your data visually may be a convenient and sufficient way to detect outliers in your data, sometimes you may require a more rigorous approach to outlier detection.
### Standard Deviation
One simple way to check the extremity of your observation is to calculate how many standard deviations it falls from the mean.
Let's start by calculating the standard deviation of our dataset by using the "sd" function.
```{r}
sd <- sd(data)
print(sd)
```
Next, let's calculate the mean of our dataset.
```{r}
mean <- mean(data)
print(mean)
```
Finally, for each record in our vector, let's calculate how many standard deviations it falls from the mean.
```{r}
extremity <- abs(data - mean) / sd
print(extremity)
```
## Removing Outliers
After identifying your outliers you have several options to remove them.
Your first option would be to manually remove a specific outlier.
```{r}
manually_cleaned <- data[data != 99]
print(manually_cleaned)
```
A more robust option would be to rely on your previously performed calculations to remove any observations which are located too far away from the mean.
```{r}
statistically_cleaned <- data[extremity < 3]
print(statistically_cleaned)
```
## Resources
"Statistics - Standard Deviation" by W3 Schools: <https://www.w3schools.com/statistics/statistics_standard_deviation.php>
"Identifying outliers with the 1.5xIQR rule":
<https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule>